NOTE: Run this file in "info" virtual environment and "info" kernel

In [38]:
import pandas as pd

df=pd.read_csv('../dataset/original_corpus_Jeng.csv')
df=df.reset_index()  # make sure indexes pair with number of rows

parties_list=[]

df['title']=df['title'].astype('str')

for index, row in df.iterrows():
    parties=[]
    title = row['title']

    if "Democrats" in title:
        parties.append("Democrats")
    if "Republicans" in title:
        parties.append("Republicans")
    if len(parties)>0:
        parties_list.append(parties)

df = df.set_index('id')
df=df.drop(columns=['index'])

filtered_df = df[df['title'].str.contains('Democrats') | df['title'].str.contains('Republicans')]
filtered_df['parties'] = parties_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['parties'] = parties_list


We want to add dataframe with articles that doesn't contain any names of political parties
- Take sample of random articles = 1000
- Select articels that don't contain any names of political parties in their titles
We do that to balance the dataset. (≈1000 doc with "Democrats", ≈1000 with "Republicans", and ≈1000 without any parties)

In [39]:
noParties_df1000 = df.sample(n=1000)

flag_list=[]
for index, r in noParties_df1000.iterrows():
    flag="t"
    t=r['title']

    flag="f" if ("Democrats" in t) or ("Republicans" in t) else "t"
    flag_list.append(flag)

Only choose articels that don't contain names of parties in their title (e.g. flag="t")

In [40]:
noParties_df1000['flag'] = flag_list
noParties_df1000=noParties_df1000[noParties_df1000['flag']=="t"]

noParties_df1000=noParties_df1000.drop(columns=['flag']) 
noParties_df1000.shape

(958, 13)

In [41]:
#add parties column to merge this dataframe with the original one
noParties_df1000['parties'] = "[]"

final_df=pd.concat([filtered_df,noParties_df1000]) #merge both dataframes
final_df=final_df[final_df['category']=='Politics'] #filter 

final_df['parties'].value_counts()
# final_df.head()

[Republicans]               1005
[]                           958
[Democrats]                  913
[Democrats, Republicans]      88
Name: parties, dtype: int64

Now we will add Named-entities

In [52]:
##########################
#These two lines are very importantto run spaCy witout conflict
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'
########################

import spacy
from spacy.lang.en.examples import sentences 
nlp = spacy.load("en_core_web_trf" ,disable=["parser"])  #English model

Function to extract named-entities from text.
We extracted named entities like persons, organizations, and events, and ignore the other named entities like money, date, or miscellaneous.

In [43]:
def generate_NE (text):

    labels = ['PERSON','NORP', 'FAC', 'ORG', 'GPE', 'LOC', 'PRODUCT',
              'EVENT', 'WORK_OF_ART', 'LAW']
    
    ne_list = []
    
    #Extract NEs
    nlp_line = nlp.pipe(text) 

    #Select only entities in the list of labels and ignore the other types of entities
    for ent in nlp_line.ents:
        if labels.count(ent.label_):
            ne_list.append({ent.text:ent.label_})
    return ne_list

In [None]:
final_df=final_df.reset_index()  # make sure indexes pair with number of rows

NE_list=[]

for index, row in final_df.iterrows():
    ne=generate_NE(row['text']) 
    NE_list.append(ne)
    print(row['id'])
    
final_df['NE']=NE_list

I noticed that there are duplicates in NE's list. so I wrote this function to remove duplicates

In [63]:
def remove_duplicates(mlist):
    return list(dict.fromkeys(mlist))

final_df['NE'] = final_df['NE'].apply(remove_duplicates)
final_df['NE'].head(5)

0    [CAMBRIDGE, Md., Obama, Biden, Democrats, Repu...
1    [Republicans, Obama, House Budget Committee, P...
2    [GOP, Super Tuesday, Congress, Ohio, Republica...
3    [Illinois, Democrats, Republican, House, Chica...
4    [House, Republican, Democrats, Rosalind Helder...
Name: NE, dtype: object

In [65]:
final_df.to_csv("../dataset/corpus_with_parties_NE_24.10.2022.csv", encoding='utf-8')