# Twitter Dataset

### Goal 

Our classifcation models work so well that we want to challenge them with another kind of dataset : [tweets](https://www.kaggle.com/kazanova/sentiment140) !

In [115]:
import pandas as pd
import re
from sklearn.model_selection import train_test_split

In [116]:
data= pd.read_csv(r"C:\Users\wenceslas\Documents\cours\ENSAE\2A\Normal\statapp\nlp_understanding\data\01_raw\test_twitter\twitter_data.csv"
                 , encoding= "latin-1"
                 , names= ["label", "id", "date", "untrucbizarre", "name", "reviews"])

In [117]:
data.head()

Unnamed: 0,label,id,date,untrucbizarre,name,reviews
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [118]:
data["label"].value_counts()

4    800000
0    800000
Name: label, dtype: int64

In [119]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   label          1600000 non-null  int64 
 1   id             1600000 non-null  int64 
 2   date           1600000 non-null  object
 3   untrucbizarre  1600000 non-null  object
 4   name           1600000 non-null  object
 5   reviews        1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [120]:
data[data["label"] == 0].head()["reviews"].values[1]

"is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!"

In [121]:
data[data["label"] == 4].tail()["reviews"].values[0]

'Just woke up. Having no school is the best feeling ever '

In [122]:
data.drop(columns= ["id", "date", "untrucbizarre", "name"], inplace= True)

data.columns

Index(['label', 'reviews'], dtype='object')

## Clean tweets

In [123]:
# Copy from https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python/11332580
remove_link= lambda text: re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''', " ", text)

In [124]:
data_cleaned= data.copy()

In [125]:
data_cleaned["reviews"]= data_cleaned["reviews"].apply(remove_link)

In [126]:
data_cleaned.head()

Unnamed: 0,label,reviews
0,0,"@switchfoot - Awww, that's a bummer. You sh..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [127]:
remove_arobase= lambda text: re.sub(r"(@\w*)", "", text)

In [128]:
data_cleaned["reviews"]= data_cleaned["reviews"].apply(remove_arobase)

In [129]:
data_cleaned= data_cleaned.sample(frac=1)

In [130]:
# replace label == 4 to 1
data_cleaned["label"]= data_cleaned["label"].apply(lambda x: 1 if x == 4 else x)

In [131]:
data_cleaned.tail()

Unnamed: 0,label,reviews
102606,0,Struggling with a nasty bug in PHP with '#' in...
1407366,1,The Red Rose Society has a weekly meeting wher...
1334286,1,Yay! Ill race you lol
434838,0,How is an ex stripper/freak who puts her busi...
531014,0,Dhoni run out of a wide and IND faltering big ...


In [132]:
 X_train, X_test, y_train, y_test= train_test_split(data_cleaned["reviews"], data_cleaned["label"], test_size=0.10, random_state=1)

In [133]:
 X_train, X_valid, y_train, y_valid= train_test_split(X_train, y_train, test_size=0.10, random_state=1)

In [134]:
print(X_train.shape)
print(X_test.shape)
print(X_valid.shape)

assert X_train.shape == y_train.shape
assert X_test.shape == y_test.shape
assert X_valid.shape == y_valid.shape

(1296000,)
(160000,)
(144000,)


## Save

In [135]:
train= pd.DataFrame({"review": X_train, "label": y_train})
test= pd.DataFrame({"review": X_test, "label": y_test})
valid= pd.DataFrame({"review": X_valid, "label": y_valid})

In [136]:
print(train.shape)
print(test.shape)
print(valid.shape)

(1296000, 2)
(160000, 2)
(144000, 2)


In [137]:
train.to_csv(r"C:\Users\wenceslas\Documents\cours\ENSAE\2A\Normal\statapp\nlp_understanding\data\01_raw\test_twitter\train_data.csv"
                 , encoding= "latin-1", index= False)
test.to_csv(r"C:\Users\wenceslas\Documents\cours\ENSAE\2A\Normal\statapp\nlp_understanding\data\01_raw\test_twitter\test_data.csv"
                 , encoding= "latin-1", index= False)
valid.to_csv(r"C:\Users\wenceslas\Documents\cours\ENSAE\2A\Normal\statapp\nlp_understanding\data\01_raw\test_twitter\valid_data.csv"
                 , encoding= "latin-1", index= False)