# Sentiment analysis
1. Task type: NLP
2. Dataset: Tweets (Text type)
3. Usecases: Social Media management, Review Systems, News Analysis for Stock Markets

In [48]:
import pandas as pd
import numpy as np

In [49]:
df = pd.read_csv('./data/train.csv', encoding='latin-1', names=['Target', 'TweetID', 'Date', 'No_Query', 'UserName', 'Data'])
df.head()

Unnamed: 0,Target,TweetID,Date,No_Query,UserName,Data
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


Observation: 
1. Data is not utf-8 encoded that is why required to set correct encoding to read csv file.
2. Data is not having column names that is why provided it with column name.

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column    Non-Null Count    Dtype 
---  ------    --------------    ----- 
 0   Target    1600000 non-null  int64 
 1   TweetID   1600000 non-null  int64 
 2   Date      1600000 non-null  object
 3   No_Query  1600000 non-null  object
 4   UserName  1600000 non-null  object
 5   Data      1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


As it is visible to me that Target and Data are only columns useful for me to train model for sentiment detection, I can drop other columns.

In [51]:
df = df[['Target', 'Data']]
df.head()

Unnamed: 0,Target,Data
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


Now, I will try to remove words which usually does not contribute to sentiments like tags (@username in data in tweet) and urls. I will keep hashtags as of now to check if they make any effect on data or not. 

In [52]:
df['Data'] = df['Data'].replace(r'http\S+', '', regex=True).replace(r'@\S+', '', regex=True)

In [53]:
df.head(20)

Unnamed: 0,Target,Data
0,0,"- Awww, that's a bummer. You shoulda got Da..."
1,0,is upset that he can't update his Facebook by ...
2,0,I dived many times for the ball. Managed to s...
3,0,my whole body feels itchy and like its on fire
4,0,"no, it's not behaving at all. i'm mad. why am..."
5,0,not the whole crew
6,0,Need a hug
7,0,"hey long time no see! Yes.. Rains a bit ,onl..."
8,0,nope they didn't have it
9,0,que me muera ?


Data info showing data is not having null values and datatypes are int64 or objects. Now I need to determine language of text for each statement as I want my model to get trained for english only. 

In [54]:
from langdetect import detect

In [55]:
from numpy import NaN


for i in range(len(df)):
    if df['Data'][i].isspace() == True:
        df['Data'][i] = NaN

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Data'][i] = NaN


In [56]:
df = df[df['Data'].notna()]

In [57]:
df = df.reset_index(drop=True)

In [58]:
# for i in range(len(df)):
#     try:
#         detect(df['Data'][i])
#     except:
#         print(i)

In [59]:
import string
for char in string.punctuation:
    df['Data'] = df['Data'].replace(char, NaN, regex=False)

In [60]:
df.head()

Unnamed: 0,Target,Data
0,0,"- Awww, that's a bummer. You shoulda got Da..."
1,0,is upset that he can't update his Facebook by ...
2,0,I dived many times for the ball. Managed to s...
3,0,my whole body feels itchy and like its on fire
4,0,"no, it's not behaving at all. i'm mad. why am..."


In [61]:
df['ln']=[0]*len(df)
print(df.head())
for i in range(len(df)):
    try:
        x = detect(df['Data'][i])
        df['ln'][i] = x
    except:
        df['ln'][i]=NaN

for i in range(len(df)):
    if df['ln'][i]=='en':
        df['ln'][i]='en'
    else:
        df['ln'][i]=NaN

   Target                                               Data  ln
0       0    - Awww, that's a bummer.  You shoulda got Da...   0
1       0  is upset that he can't update his Facebook by ...   0
2       0   I dived many times for the ball. Managed to s...   0
3       0    my whole body feels itchy and like its on fire    0
4       0   no, it's not behaving at all. i'm mad. why am...   0


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['ln'][i] = x
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [62]:
df.head(20)

Unnamed: 0,Target,Data,ln
0,0,"- Awww, that's a bummer. You shoulda got Da...",en
1,0,is upset that he can't update his Facebook by ...,en
2,0,I dived many times for the ball. Managed to s...,en
3,0,my whole body feels itchy and like its on fire,en
4,0,"no, it's not behaving at all. i'm mad. why am...",en
5,0,not the whole crew,en
6,0,Need a hug,en
7,0,"hey long time no see! Yes.. Rains a bit ,onl...",en
8,0,nope they didn't have it,en
9,0,que me muera ?,


In [65]:
df.to_csv('./data/trainModified.csv')

In [66]:
df = pd.read_csv('./data/trainModified.csv')
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Target,Data,ln
0,0,0,0,"- Awww, that's a bummer. You shoulda got Da...",en
1,1,1,0,is upset that he can't update his Facebook by ...,en
2,2,2,0,I dived many times for the ball. Managed to s...,en
3,3,3,0,my whole body feels itchy and like its on fire,en
4,4,4,0,"no, it's not behaving at all. i'm mad. why am...",en
