## Exploratory analysis of the tweet sentiment expression dataset.

### Introduction :  


The main objective of our project is to develop an intelligent system capable of analyzing and understanding the sentiment of a message posted by a user on a social network, in this case here [Twitter](https://twitter. com/home?lang=fr). Using advanced natural language processing (NLP) techniques, our solution will seek to extract key information, establish semantic connections and create a representation of user sentiment.

## Cleaning data

## Import

We will use a usual Data Science stack: `numpy`, `pandas`, `sklearn`, `matplotlib`.

In [1]:
# data manipulation
import numpy as np
import pandas as pd

# graphic representation
import matplotlib.pyplot as plt
import seaborn as sns

# file system management
import os

The data that we will use during our study has the following relative path ["../Data"](). We can find in the [./Data]() folder a csv file which includes a list of user tweets. It is in this folder that we will save our dataframes to be able to use them in other operations of our study.

In [2]:
print(os.listdir("../Data"))

['.DS_Store', 'tweets.csv']


We import the file which will be useful for our study.

In [3]:
data = pd.read_csv('../Data/tweets.csv', encoding='latin-1', header=None)
data = data.rename(columns={data.columns[0]: 'target'})
data = data.rename(columns={data.columns[1]: 'id'})
data = data.rename(columns={data.columns[2]: 'date'})
data = data.rename(columns={data.columns[3]: 'flag'})
data = data.rename(columns={data.columns[4]: 'user'})
data = data.rename(columns={data.columns[5]: 'text'})
data

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


The file contains 1,600,000 user tweets extracted using Twitter's API. The tweets have been annotated (0 = negative, 4 = positive) and can be used to detect the sentiment of a tweet.

The DataFrame is characterized by 6 variables:

1. target: the sentiment of the tweet (0 = negative, 4 = positive)
2. id: tweet id (2087)
3. date: tweet date (Sat May 16 23:58:44 UTC 2009)
4. flag: The query (LyX). If there is no query then the value will be NO_QUERY
5. user: the user who tweeted (robotickilldozr)
6. text: the text of the tweet (with LyX)

In [4]:
with pd.option_context('display.max_colwidth', None):
    print(data.iloc[0, :])

target                                                                                                                      0
id                                                                                                                 1467810369
date                                                                                             Mon Apr 06 22:19:45 PDT 2009
flag                                                                                                                 NO_QUERY
user                                                                                                          _TheSpecialOne_
text      @switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D
Name: 0, dtype: object


We apply text simplification by removing user identification, web links, single characters, numeric characters, non-alphanumeric characters.

In [5]:
import re
data['text'] = data['text'].apply(lambda x: re.sub(r'\S*@\S*\s?', '', x))
data['text'] = data['text'].apply(lambda x: re.sub(r'http\S+', '', x))
data['text'] = data['text'].apply(lambda x: re.sub(r'[^\w\s]', '', x))
data['text'] = data['text'].apply(lambda x: re.sub(r'\b\w\b', '', x))
data['text'] = data['text'].apply(lambda x: re.sub(r'\d', '', x))
data['text'] = data['text'].apply(lambda x: re.sub(r'\s+', ' ', x))
data['text'] = data['text'].apply(lambda x: x.lower())


In [7]:
with pd.option_context('display.max_colwidth', None):
    print(data.iloc[0, :])

target                                                                       0
id                                                                  1467810369
date                                              Mon Apr 06 22:19:45 PDT 2009
flag                                                                  NO_QUERY
user                                                           _TheSpecialOne_
text       awww thats bummer you shoulda got david carr of third day to do it 
Name: 0, dtype: object


The nltk library allows us to remove stopwords. These are words that recur in a language but do not provide additional information for understanding the text.

In [8]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/gaeldelescluse/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [9]:
data['text_token'] = 0

for i in range(len(data['text'])):
    data['text_token'][i] = nltk.word_tokenize(data['text'][i])

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  data['text_token'][i] = nltk.word_tokenize(data['text'][i])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dat

In [10]:
def remove_stopwords(word_list):
    stop_words = set(stopwords.words('english'))
    filtered_words = []
    for word in word_list:
        if word not in stop_words:
            filtered_words.append(word)
    return filtered_words

In [11]:
for i in range(len(data['text_token'])):
    data['text_token'][i] = remove_stopwords(data['text_token'][i])

In [12]:
with pd.option_context('display.max_colwidth', None):
    print(data.iloc[0, :])

target                                                                           0
id                                                                      1467810369
date                                                  Mon Apr 06 22:19:45 PDT 2009
flag                                                                      NO_QUERY
user                                                               _TheSpecialOne_
text           awww thats bummer you shoulda got david carr of third day to do it 
text_token            [awww, thats, bummer, shoulda, got, david, carr, third, day]
Name: 0, dtype: object


In [13]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('punkt')# segmentation phrases
nltk.download('averaged_perceptron_tagger') # étiquettes grammaticales
nltk.download('wordnet')# synonymes

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/gaeldelescluse/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/gaeldelescluse/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/gaeldelescluse/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Lemmatization allows us to simplify the meaning of sentences by finding the root of each word. We thus remove the conjugation or even the plural from each word.

In [14]:
def lemmatize_words(word_list):
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word, get_pos(word)) for word in word_list]
    return lemmatized_words

def get_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV, "J": wordnet.ADJ}
    return tag_dict.get(tag, wordnet.NOUN)

In [15]:
for i in range(len(data['text_token'])):
    data['text_token'][i] = lemmatize_words(data['text_token'][i])

In [16]:
word_dataset = '../Data/normalized_dataset.csv'
data.to_csv(word_dataset, index=False)

In [17]:
df = pd.read_csv("../Data/normalized_dataset.csv")

In [18]:
import ast
import re

array_list = df['text_token'].values
data_list = []
for item in array_list:
    data_list.append(ast.literal_eval(item))

df_list = pd.DataFrame({'text_token': data_list})
df = df.drop(columns=['text_token'])
df['text_token'] = df_list['text_token']
df['words'] = df['text_token'].apply(lambda x: ' '.join(x))
df = df.drop(columns=['text_token'])

In [19]:
with pd.option_context('display.max_colwidth', None):
    print(df.iloc[0, :])

target                                                                       0
id                                                                  1467810369
date                                              Mon Apr 06 22:19:45 PDT 2009
flag                                                                  NO_QUERY
user                                                           _TheSpecialOne_
text       awww thats bummer you shoulda got david carr of third day to do it 
words                       awww thats bummer shoulda get david carr third day
Name: 0, dtype: object


In [20]:
word_dataset = '../Data/cleaned_dataset.csv'
df.to_csv(word_dataset, index=False)

In [21]:
df = pd.read_csv("../Data/cleaned_dataset.csv")

In [22]:
df_neg = df[df['target']== 0].sample(500000)
df_pos = df[df['target']== 4].sample(500000)
df_pos['target'] = 1
liste_concat = [df_neg, df_pos]
df_sample = pd.concat([df_neg, df_pos], ignore_index=True)
df_sample = df_sample.sample(frac=1).reset_index(drop=True)
df_sample

Unnamed: 0,target,id,date,flag,user,text,words
0,0,1967456238,Fri May 29 19:36:49 PDT 2009,NO_QUERY,ozpancakes,oh well than thats understandable specially si...,oh well thats understandable specially since f...
1,1,1974019720,Sat May 30 12:16:52 PDT 2009,NO_QUERY,IslandBookworm,have friend who thinks it should be civic dut...,friend think civic duty leave one wifi open
2,0,1991907659,Mon Jun 01 07:52:24 PDT 2009,NO_QUERY,discostickOx,can anybody tell me how to upload picture on t...,anybody tell upload picture thiss say mine big
3,1,1751642499,Sat May 09 19:55:53 PDT 2009,NO_QUERY,sharonhayes,beautiful song for anyone that could use pick...,beautiful song anyone could use pick tonight
4,1,2066590713,Sun Jun 07 10:48:27 PDT 2009,NO_QUERY,Orli,exactly ive told them it reminds me the house ...,exactly ive told reminds house hansel gretel
...,...,...,...,...,...,...,...
999995,0,1678810323,Sat May 02 07:46:58 PDT 2009,NO_QUERY,dorothysiok,followed the saga when was supposed to be mugg...,follow saga suppose mug hard guilty
999996,0,1980260262,Sun May 31 06:12:43 PDT 2009,NO_QUERY,ichbinkatie,woke up angry,woke angry
999997,1,1833791018,Mon May 18 00:28:24 PDT 2009,NO_QUERY,popitlockit,ranch all the way or italian either or,ranch way italian either
999998,1,1793829089,Thu May 14 04:07:38 PDT 2009,NO_QUERY,canaaa,editing my fs and multiply,edit f multiply


In [23]:
sample_df = '../Data/sample_dataset.csv'
df_sample.to_csv(sample_df, index=False)