<a id=contents></a>

# Cleaning notebook
## Subtitle


[1. Data Inspection](#insp)

[2. Using Spark and Hadoop](#numerical)

[3. Cleaning categorical data](#categ)

[4. Cleaning text data](#text)

In [22]:
%load_ext autoreload
%autoreload 2 

import pandas as pd
import numpy as n
from pathlib import Path

import functions.functions as fn

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


<a id=insp ><a/> 

## 1. Data Inspection
    
[LINK to table of contents](#contents)

In [23]:
# 4-5 min to run
raw_path = Path('data/raw')
clean, cvt = fn.clean_tweet_text(raw_path)

In [24]:
clean.head()

Unnamed: 0.1,Unnamed: 0,tweet_id,datetime,display_name,tweet_text,User_id,extracted_twitter_handles,extracted_URLs,extracted_hashtags,clean_tweet_text,...,woman,work,working,world,would,yeah,year,yes,yet,youre
0,0,1579299680491823104,2022-10-10 02:36:00+00:00,ElonMusk,@CathieDWood ðŸ’¯,44196397.0,[@CathieDWood],[],[],ðŸ’¯,...,0,0,0,0,0,0,0,0,0,0
1,1,1579263949945835521,2022-10-10 00:14:01+00:00,ElonMusk,@WholeMarsBlog It will,44196397.0,[@WholeMarsBlog],[],[],It will,...,0,0,0,0,0,0,0,0,0,0
2,2,1579262341904203777,2022-10-10 00:07:38+00:00,ElonMusk,@chicago_glenn @neiltyson Strange @Twitter,44196397.0,"[@chicago_glenn, @neiltyson, @Twitter]",[],[],Strange,...,0,0,0,0,0,0,0,0,0,0
3,3,1579258883977416705,2022-10-09 23:53:53+00:00,ElonMusk,@WholeMarsBlog I have no desire to become invo...,44196397.0,[@WholeMarsBlog],[],[],I have no desire to become involved in wars bu...,...,0,0,0,0,0,0,0,0,0,0
4,4,1579235577492574209,2022-10-09 22:21:17+00:00,ElonMusk,@Teslarati Strong candidate to win most counte...,44196397.0,[@Teslarati],[],[],Strong candidate to win most counterintuitive ...,...,0,0,0,0,0,0,1,0,0,0


In [25]:
clean.to_csv('data/clean/clean_data.csv')

In [26]:
# now selecting those parts of the data that will be used only classification, i.e. the vectorized columns
cols  =list(clean.columns)
start_cvt_feats = cols.index('clean_tweet_text')+1
cvt_cols  = clean.columns[start_cvt_feats:]
cvt_cols= cvt_cols.insert(0, 'tweet_id')
feats_df = clean[cvt_cols]
feats_df.to_csv('data/clean/features/count_vect_feats.csv')

In [27]:
clean_display = clean[cols[1:start_cvt_feats]]
clean_display.to_csv('data/clean/clean_display_data.csv')

In [28]:
clean.head(1)

Unnamed: 0.1,Unnamed: 0,tweet_id,datetime,display_name,tweet_text,User_id,extracted_twitter_handles,extracted_URLs,extracted_hashtags,clean_tweet_text,...,woman,work,working,world,would,yeah,year,yes,yet,youre
0,0,1579299680491823104,2022-10-10 02:36:00+00:00,ElonMusk,@CathieDWood ðŸ’¯,44196397.0,[@CathieDWood],[],[],ðŸ’¯,...,0,0,0,0,0,0,0,0,0,0


In [30]:
import functions.nlp_eda as nlp

In [31]:
# we're also gonna get the tf-idf data since that gives us a different way 
# to analytically assess the text and words

tfdf = nlp.get_tfidf_df(clean_display, 'clean_tweet_text' )

In [36]:
tfdf.iloc[:, start_cvt_feats-1:].to_csv('data/clean/features/tf_idf_df.csv')

<a id=numerical ><a/> 

## 2. Using Spark and Hadoop
    
[LINK to table of contents](#contents)

After playing around and exploring the dataset i'd gathered, I realised that all my data processing was taking really long and needed to be streamlined. So I decided to switch over to Spark and Hadoop's HDFS. 

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np

In [2]:
import findspark 


In [6]:
import pyspark
pyspark.find_spark_home

<module 'pyspark.find_spark_home' from '/Users/ipreoteasa/opt/anaconda3/envs/dev/lib/python3.10/site-packages/pyspark/find_spark_home.py'>

In [9]:
findspark.init('/Users/ipreoteasa/opt/anaconda3/envs/dev/lib/python3.10/site-packages/pyspark/') 

In [10]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('twitter_sentiment_tracker').master('local[*]').getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/11 12:21:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [12]:
df = pd.read_csv('data/clean/clean_display_data.csv')
df.info()

  df = pd.read_csv('data/clean/clean_display_data.csv')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 676653 entries, 0 to 676652
Data columns (total 10 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   Unnamed: 0                 676653 non-null  object 
 1   tweet_id                   676642 non-null  float64
 2   datetime                   676642 non-null  object 
 3   display_name               676642 non-null  object 
 4   tweet_text                 676642 non-null  object 
 5   User_id                    676342 non-null  object 
 6   extracted_twitter_handles  559523 non-null  object 
 7   extracted_URLs             559523 non-null  object 
 8   extracted_hashtags         559523 non-null  object 
 9   clean_tweet_text           538344 non-null  object 
dtypes: float64(1), object(9)
memory usage: 51.6+ MB


In [22]:
pd.to_datetime('2022-10-10 16:15:43+00:00') >= pd.to_datetime('2022-10-03 16:15:43+00:00')

True

In [19]:
df.dropna().loc[df.dropna().clean_tweet_text.str.startswith('Ukraine')]

Unnamed: 0.1,Unnamed: 0,tweet_id,datetime,display_name,tweet_text,User_id,extracted_twitter_handles,extracted_URLs,extracted_hashtags,clean_tweet_text
113,113,1.576969e+18,2022-10-03 16:15:43+00:00,ElonMusk,Ukraine-Russia Peace:\n\n- Redo elections of a...,44196397.0,[],[],[],UkraineRussia Peace Redo elections of annexed ...
384,384,1.578662e+18,2022-10-08 08:20:12+00:00,KonstantinKisin,Ukraine asked for missiles it could use to dis...,1495726466.0,[],['https://t.co/1d8Fv0vEU2'],[],Ukraine asked for missiles it could use to dis...
1811,1811,1.579156e+18,2022-10-09 17:06:00+00:00,HuXijin_GT,Ukraine changing its mind is a sign of being b...,2775998016.0,[],['https://t.co/wyC0YNEYyM'],[],Ukraine changing its mind is a sign of being b...
1824,1824,1.576980e+18,2022-10-03 16:58:11+00:00,HuXijin_GT,Ukraine is fighting for NATO in a sense. NATO ...,2775998016.0,[],['https://t.co/ZhgKQT3h70'],[],Ukraine is fighting for NATO in a sense NATO h...
1831,1831,1.575737e+18,2022-09-30 06:39:28+00:00,HuXijin_GT,Ukraineâ€™s situation is at an extremely dangero...,2775998016.0,[],['https://t.co/gpH7C7U04m'],[],Ukraines situation is at an extremely dangerou...
...,...,...,...,...,...,...,...,...,...,...
676286,517349,1.578282e+18,2022-10-07 07:12:22+00:00,ukraine_world,Ukraineâ€™s rapid breakthrough in the Kharkiv Ob...,873135988440223744.0,[],['https://t.co/Fsozl9xAap'],['#UkraineWillWin'],Ukraines rapid breakthrough in the Kharkiv Obl...
676296,517359,1.578053e+18,2022-10-06 16:01:32+00:00,ukraine_world,Ukraine's President Volodymyr Zelenskyy says h...,873135988440223744.0,['@YouTube'],['https://t.co/7aFkWG8h0X'],[],Ukraines President Volodymyr Zelenskyy says he...
676320,517383,1.577987e+18,2022-10-06 11:41:51+00:00,ukraine_world,Ukraine recaptured more than 6000 sq. km of te...,873135988440223744.0,['@YouTube'],['https://t.co/Z9D8gSNfgN'],['#UkraineWillWin'],Ukraine recaptured more than 6000 sq km of ter...
676326,517389,1.577980e+18,2022-10-06 11:13:59+00:00,ukraine_world,Ukraineâ€™s President Zelenskyy has urged world ...,873135988440223744.0,['@guardian'],['https://t.co/eV2mxdwxKY'],"['#StandWithIUkraine', '#StopRussiaNOW']",Ukraines President Zelenskyy has urged world l...


In [16]:
df['datetime'] = pd.to_datetime(df['datetime'])

# now we want a very simple before and after ElonMusk's tweet proposing elections in Ukraine
cutoff_date = pd.to_datetime('2022-10-03 16:15:43+00:00')

# now adding a filter column: stating whether a date is before or after this tweet. 


df['Before_or_after_controversy'] = df['datetime'].apply(lambda x : fn.is_it_before_or_after(x, cutoff_date))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 676653 entries, 0 to 676652
Data columns (total 10 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   Unnamed: 0                 676653 non-null  object 
 1   tweet_id                   676642 non-null  float64
 2   datetime                   676642 non-null  object 
 3   display_name               676642 non-null  object 
 4   tweet_text                 676642 non-null  object 
 5   User_id                    676342 non-null  object 
 6   extracted_twitter_handles  559523 non-null  object 
 7   extracted_URLs             559523 non-null  object 
 8   extracted_hashtags         559523 non-null  object 
 9   clean_tweet_text           538344 non-null  object 
dtypes: float64(1), object(9)
memory usage: 51.6+ MB


In [13]:
df['datetime'] = pd.to_datetime(df['datetime'])


ParserError: Unknown string format: []

<a id=categ ><a/> 

## 3. Cleaning categorical data
   
[LINK to table of contents](#contents)

<a id=text ><a/> 

## 4. Cleaning text data
    
[LINK to table of contents](#contents)