# Cleaning Data

In [23]:
#Importing necessary library
import pandas as pd # for handling data

In [24]:
ls     # Viewing files to avoid typo

 Volume in drive C is New Volume
 Volume Serial Number is 24F0-9997

 Directory of C:\Project\practice\Projects\Real-time Twitter Survey Analysis on US Election 2020

08/23/2020  05:02 AM    <DIR>          .
08/23/2020  05:02 AM    <DIR>          ..
08/23/2020  03:23 AM    <DIR>          .ipynb_checkpoints
08/23/2020  03:20 AM            11,813 1. Collecting Data using Twitter API.ipynb
08/23/2020  05:02 AM            23,513 2. Cleaning Data.ipynb
08/23/2020  03:01 AM           310,730 biden_tweets.csv
08/23/2020  03:05 AM           322,930 democrat_tweets.csv
08/23/2020  03:05 AM            63,840 kamala_tweets.csv
08/23/2020  03:07 AM            63,043 mikepence_tweets.csv
08/23/2020  03:07 AM           325,784 republican_tweets.csv
08/23/2020  03:07 AM           308,381 trump_tweets.csv
               8 File(s)      1,430,034 bytes
               3 Dir(s)  243,494,387,712 bytes free


### Import & organize files

In [25]:
# Making single dataframe for republican

trump = pd.read_csv('trump_tweets.csv')
mike = pd.read_csv('mikepence_tweets.csv')
republican = pd.read_csv('republican_tweets.csv')

republican_data=pd.concat([trump,mike,republican], ignore_index=True) # Concating three dataframes
republican_data.sample(10)   # sample of combined dataframe

Unnamed: 0,Tweets
3832,@politvidchannel One more Republican dumps trump!
2530,"Democratic plan in rural, swing state counties..."
3800,@dplaz19761 I see is as:\n\nTodays Republican ...
3741,@Acosta It is false but we should definitely t...
2898,"Quoting @mikepence ""500,000 new manufacturing ..."
3815,"@marcorubio Over the last 3.5 years, over 50 R..."
2331,@tkbuckels @BlairMarnell @JennaEllisEsq I see ...
5210,As a poor Democratic leaning person I am deepl...
1668,"Those who have lost loved ones to Covid, trump..."
35,Trump doesn’t believe in god he thinks he is t...


In [26]:
# Making single dataframe for democrat

biden = pd.read_csv('biden_tweets.csv')
kamala = pd.read_csv('kamala_tweets.csv')
democrat = pd.read_csv('democrat_tweets.csv')

democrat_data=pd.concat([biden,kamala,democrat], ignore_index=True)
democrat_data.sample(10)

Unnamed: 0,Tweets
965,@AriBerman Trump's puppet DeJoy placed as Post...
2614,@BeComfy Kamala married one. \n\nNever mind......
2035,"@QuinnLisaq @CNNPolitics Simply, what do you t..."
2874,"Here's my hot take, kamala harris was deadly i..."
3609,@Jim_Jordan Another Democrat scam straight fro...
2564,@SarahMcord @DogfromTheThing @Jourd_ @davidsir...
4528,Democrat Black Lives Matter Leader: We Could '...
565,@EqualTx @MaryCal18844902 @overitall69 @PipiPu...
1113,"""What Have Democrats Done To Solve ANYTHING?"":..."
359,A big reason to not support Joe Biden/Kamala H...


##### We created corpus for democrat and republican party tweets

# Data Cleaning & Preprocessing
**Common data cleaning steps on all text:**

- Make text all lower case
- Remove punctuation
- Remove numerical values
- Remove common non-sensical text (/n)
- Tokenize text
- Remove stop words

**More data cleaning steps after tokenization:**

- Stemming / lemmatization
- Parts of speech tagging
- Create bi-grams or tri-grams
- Deal with typos
- And more...

In [27]:
republican_data.shape   # we have 5500 rows of tweets for republican

(5500, 1)

In [28]:
democrat_data.shape    # we have 5500 rows of tweets for democrat

(5500, 1)

**Removing duplicate rows**

In [29]:
republican_data.drop_duplicates(subset=None, keep='first', inplace=True) # removing duplicate rows

In [30]:
republican_data.shape

(5446, 1)

In [31]:
democrat_data.drop_duplicates(subset=None, keep='first', inplace=True)

In [32]:
democrat_data.shape

(5449, 1)

**Handling Na/Null values**

In [33]:
republican_data.isna().any()

Tweets    False
dtype: bool

In [34]:
democrat_data.isna().any()

Tweets    False
dtype: bool

**Making lowercase, removing punctuation, numerical values & so on..**

In [35]:
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [36]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    text = re.sub(r"^https://t.co/[a-zA-Z0-9]*\s"," ",text)
    text = re.sub(r"\s+https://t.co/[a-zA-Z0-9]\s+"," ",text)
    text = re.sub(r"\s+http://t.co/[a-zA-Z0-9]*$"," ",text)
    text=re.sub(r"rt"," ",text)
    text=re.sub(r'@[^\s]+',' ',text)
    text=re.sub(r"that's","that is",text)
    text=re.sub(r"there's","there is",text)
    text=re.sub(r"what's","what is",text)
    text=re.sub(r"it's","it is",text)
    text=re.sub(r"who's","who is",text)
    text=re.sub(r"i'm","i am",text)
    text=re.sub(r"she's","she is",text)
    text=re.sub(r"he's","he is",text)
    text=re.sub(r"they're","they are",text)
    text=re.sub(r"who're","who are",text)
    text=re.sub(r"ain't","am not",text)
    text=re.sub(r"don't","do not",text)
    text=re.sub(r"doesn't","does not",text)
    text=re.sub(r"didn't","did not",text)
    text=re.sub(r"wouldn't","would not",text)
    text=re.sub(r"shouldn't","should not",text)
    text=re.sub(r"can't","can not",text)
    text=re.sub(r"isn't","is not",text)
    text=re.sub(r"it's","it is not",text)
    text=re.sub(r"isn't","is not",text)
    text=re.sub(r"wasn't","was not",text)
    text=re.sub(r"weren't","were not",text)
    text=re.sub(r"couldn't","could not",text)
    text=re.sub(r"won't","will not",text)
    text=re.sub(r"\W"," ",text)
    text=re.sub(r"\d"," ",text)
    text=re.sub(r"\s+[a-zA-Z]\s+"," ",text)
    text=re.sub(r"\s+[a-zA-Z]$"," ",text)
    text=re.sub(r"^[a-z]\s+"," ",text)
    text=re.sub(r"https"," ",text)
    text=re.sub(r"http\s+","",text)
    text=re.sub(r"yifmqy"," ",text)
    text=re.sub(r"\s+"," ",text)
    text=text.strip('\'"')
    return text

round2 = lambda x: clean_text_round2(x)

In [None]:
#invoke garbage collector to free ram
import gc
gc.collect()

In [37]:
clean_democrat_df = pd.DataFrame(democrat_data['Tweets'].apply(round1)) # accessing function using apply method
clean_democrat_df = pd.DataFrame(democrat_data['Tweets'].apply(round2))

clean_republican_df = pd.DataFrame(republican_data['Tweets'].apply(round1))
clean_republican_df = pd.DataFrame(republican_data['Tweets'].apply(round2))

del democrat_data,republican_data,trump,mike,republican,biden,kamala,democrat # Deleting unecessary dataframes

In [38]:
clean_democrat_df.sample(10)   # Let's take a look at our clean dataset :) 

Unnamed: 0,Tweets
276,You could be helpful instead of condescending...
3414,This is the democrat hierarchy Theyre special ...
5495,Not President Trump The Democrat thieves are ...
2183,Rose McGowan got it right co cphBnYbC via News
1264,OMG are they all delusional Drinking from tha...
1628,Omg Please stay safe Sending you all the rain...
1247,do not criticize Biden until we can safely ig...
851,The Biden campaign did something right co sJ m...
3663,It was close Not one black Democrat said they...
4123,Right Thats why he is leading the quickest fi...


In [39]:
clean_republican_df.sample(10)

Unnamed: 0,Tweets
1272,Harry Truman famously said THE BUCK STOPS HERE...
3903,No Democrat is surprised by this and no Repub...
1505,Mr Trump stop finger pointing all the time an...
657,USPS delivers no matter where amp for who USP...
5071,Folks have you seen the latest Republican ghou...
1344,Bawahahaha Trump never stayed overnight in Rus...
3095,Does she even realize people are starving in ...
225,Ladies and Gentlemen heres Vile Hateful Trump ...
295,Senate Russia repo proves Trump collusion was ...
974,NEED PROTEST OUTSIDE OF EVERY TRUMP BUSINESS


**Removing stopwords using NLTK corpus**

In [40]:
from nltk.corpus import stopwords
stop = stopwords.words('english')  # since our dataset is in English language

In [41]:
# Removing Stopword after making lowercase and split
clean_democrat_df = clean_democrat_df['Tweets'].str.lower().str.split().apply(lambda x: [item for item in x if item not in stop])
clean_republican_df = clean_republican_df['Tweets'].str.lower().str.split().apply(lambda x: [item for item in x if item not in stop])

In [44]:
clean_democrat_df.sample(5) # Looks like python list -_- Let's try to fix this

2633    [kamala, harris, admitted, radical, running, c...
5071    [democratic, pa, betrayed, people, baltimore, ...
578                       [china, paid, biden, look, way]
4619      [except, tweet, retweeted, evid, co, tsn, mhvf]
2385    [im, black, doesnt, mean, vote, democrat, im, ...
Name: Tweets, dtype: object

In [52]:
clean_democrat=pd.DataFrame(clean_democrat_df)
clean_democrat.sample(5)

Unnamed: 0,Tweets
4960,"[sc, democrat, mayoral, candidate, staged, kid..."
2375,"[woman, come, forward, glaring, lights, focus,..."
4390,"[happening, democrat, controlled, areas, feder..."
2460,"[biden, harris, may, forge, new, path, educati..."
3670,"[fake, hate, black, democrat, faked, filmed, b..."


In [55]:
clean_republican=pd.DataFrame(clean_republican_df)

Whooo! Our dataset is clean now. :)

**Saving cleaned Data**

In [54]:
clean_democrat.to_csv('clean_democrat.csv',encoding='utf-8', index=False)
clean_republican.to_csv('clean_republican.csv',encoding='utf-8', index=False)

In [56]:
#invoke garbage collector to free ram
import gc
gc.collect()

142