## What is Sentiment Analysis?

Sentiment Analysis is a process of extracting opinions that have different polarities. By polarities, we mean positive, negative or neutral. It is also known as opinion mining and polarity detection. With the help of sentiment analysis, you can find out the nature of opinion that is reflected in documents, websites, social media feed, etc. Sentiment Analysis is a type of classification where the data is classified into different classes. These classes can be binary in nature (positive or negative) or, they can have multiple classes (happy, sad, angry, etc.).

# Please If you find this kernel helpful, upvote it to help others see it 😊
![](https://d1sjtleuqoc1be.cloudfront.net/wp-content/uploads/2019/04/25112909/shutterstock_1073953772.jpg)

## What files do I need?

You'll need train.csv, test.csv and sample_submission.csv.

## What should I expect the data format to be?

Each sample in the train and test set has the following information:

The text of a tweet
A keyword from that tweet (although this may be blank!)
The location the tweet was sent from (may also be blank)

## What am I predicting?

You are predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.

### Files
* train.csv - the training set
* test.csv - the test set
* sample_submission.csv - a sample submission file in the correct format

### Columns

* id - a unique identifier for each tweet
* text - the text of the tweet
* location - the location the tweet was sent from (may be blank)
* keyword - a particular keyword from the tweet (may be blank)
* target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)



# 1) import library & packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
import random
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB,CategoricalNB
import nltk
nltk.download('stopwords')
nltk.download('punkt')
import re
from nltk.corpus import stopwords
import string
from sklearn import preprocessing
from sklearn.manifold import TSNE
import seaborn as sns
from nltk.stem.porter import PorterStemmer
from sklearn.metrics import log_loss
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn import svm
from nltk.tokenize import word_tokenize
from sklearn.metrics import accuracy_score
from time import time
from sklearn.model_selection import StratifiedKFold

In [None]:
#default theme
sns.set(context='notebook', style='darkgrid', palette='colorblind', font='sans-serif', font_scale=1, rc=None)
matplotlib.rcParams['figure.figsize'] =[8,8]
matplotlib.rcParams.update({'font.size': 15})
matplotlib.rcParams['font.family'] = 'sans-serif'

# 2) load data & analysis

In [None]:
train = pd.read_csv('../input/nlp-getting-started/train.csv')
test = pd.read_csv('../input/nlp-getting-started/test.csv')
sub = pd.read_csv('../input/nlp-getting-started/sample_submission.csv')

In [None]:
print(train.shape,test.shape)

In [None]:
train.head()

In [None]:
train.describe(include='all')

### finding missing values

In [None]:
missing_values=train.isnull().sum()
percent_missing = train.isnull().sum()/train.shape[0]*100

value = {
    'missing_values ':missing_values,
    'percent_missing %':percent_missing
}
frame=pd.DataFrame(value)
frame

In [None]:
#Remove redundant samples
train=train.drop_duplicates(subset=['text', 'target'], keep='first')
train.shape

We have 92 redundants sapmles in our dataset

In [None]:
train.target.value_counts()

In [None]:
fig = plt.figure(figsize=(8,6))
train.groupby('target').id.count().plot.pie(explode=[0.1,0.1],autopct='%1.1f%%',shadow=True)
plt.show()

labels are not balanced

In [None]:
# Numbers of word for each sapmle in train & test data
train['text_length'] = train.text.apply(lambda x: len(x.split()))
test['text_length'] = test.text.apply(lambda x: len(x.split()))


In [None]:
train['text_length'].describe()

In [None]:
test['text_length'].describe()

Max number of words in all data is 31 and min is 1!

In [None]:
def plot_word_count(df, data_name):
  sns.distplot(df['text_length'].values)
  plt.title(f'Sequence char count: {data_name}')
  plt.grid(True)

In [None]:
#ig = plt.figure(figsize=(16,6))
#plt.hist(train["text_length"], bins = 30)
#plt.show()
plt.subplot(1, 2, 1)
plot_word_count(train, 'Train')

plt.subplot(1, 2, 2)
plot_word_count(test, 'Test')

plt.subplots_adjust(right=3.0)
plt.show()

In [None]:
# collecting all words in single list
list_= []
for i in train.text:
    list_ += i
list_= ''.join(list_)
allWords=list_.split()
vocabulary= set(allWords)

In [None]:
len(vocabulary)

We have 31480 different words in our train data

In [None]:
def create_corpus(df,target):
    corpus=[]
    
    for x in df[df['target']==target]['text'].str.split():
        for i in x:
            corpus.append(i)
    return corpus

In [None]:
#most frequent 20 words when label == 0 
import collections
allWords=create_corpus(train,target=0)
vocabulary= set(allWords)
vocabulary_list= list(vocabulary)

plt.figure(figsize=(16,5))
counter=collections.Counter(allWords)
most=counter.most_common()
x=[]
y=[]
for word,count in most[:20]:
  x.append(word)
  y.append(count)
sns.barplot(x=y,y=x)

In [None]:
#most frequent 20 words when label == 1 
import collections
allWords=create_corpus(train,target=1)
vocabulary= set(allWords)
vocabulary_list= list(vocabulary)

plt.figure(figsize=(16,5))
counter=collections.Counter(allWords)
most=counter.most_common()
x=[]
y=[]
for word,count in most[:20]:
  x.append(word)
  y.append(count)
sns.barplot(x=y,y=x)

# 3) Data Cleaning

### Removing Punctuations

In [None]:
#List of punctuations and we will remove them from our corpus
string.punctuation

In [None]:
#for  example
text='hi !! whats up bro :) i hope you enjoy with me '
"".join([char for char in text if char not in string.punctuation])

### Removing Numbers

In [None]:
#for example 
text='hey 4 look 333 at me0 58999632'
re.sub('[0-9]', '', text)

### Removing Stopwords

In [None]:
#list of stopwords
stopwords.words('english')

In [None]:
#for example
text='hey this is me and I am here to help you  '
tokens = word_tokenize(text)
tokens=[word for word in tokens if word not in stopwords.words('english')]
' '.join(tokens)

### Now let's Build a function that clean our data

I just added lower function in order to lowercase all words and stemming

In [None]:
pstem = PorterStemmer()
def clean_text(text):
    text= text.lower()
    text= re.sub('[0-9]', '', text)
    text  = "".join([char for char in text if char not in string.punctuation])
    tokens = word_tokenize(text)
    tokens=[pstem.stem(word) for word in tokens]
    #tokens=[word for word in tokens if word not in stopwords.words('english')]
    text = ' '.join(tokens)
    return text

In [None]:
clean_text("hey I am here # ! looks 4 GOOD can't see you!")

In [None]:
train["clean"]=train["text"].apply(clean_text)
test["clean"]=test["text"].apply(clean_text)

In [None]:
#Let's see the effect of cleaning
train[["text","clean"]].head()

In [None]:
# collecting all words in single list
list_= []
for i in train.clean:
    list_ += i
list_= ''.join(list_)
allWords=list_.split()
vocabulary= set(allWords)
len(vocabulary)

we reduced our data from 31480 unique words to 19920

In [None]:
tfidf = TfidfVectorizer(sublinear_tf=True,max_features=60000, min_df=1, norm='l2',  ngram_range=(1,2))
features = tfidf.fit_transform(train.clean).toarray()
features.shape

In [None]:
features_test = tfidf.transform(test.clean).toarray()

## 4) machine leaning algorithm

In [None]:
#split data into 4 parts with same distribution of classes.
skf = StratifiedKFold(n_splits=4, random_state=48, shuffle=True)
accuracy=[] # list contains the accuracy for each fold
n=1
y=train['target']

In [None]:
for trn_idx, test_idx in skf.split(features, y):
  start_time = time()
  X_tr,X_val=features[trn_idx],features[test_idx]
  y_tr,y_val=y.iloc[trn_idx],y.iloc[test_idx]
  model= LogisticRegression(max_iter=1000,C=3)
  #model=MultinomialNB(alpha=0.5)
  #model=svm.SVC(max_iter=1000)
  model.fit(X_tr,y_tr)
  s = model.predict(X_val)
  sub[str(n)]= model.predict(features_test) 
  
  accuracy.append(accuracy_score(y_val, s))
  print((time() - start_time)/60,accuracy[n-1])
  n+=1

In [None]:
accuracy

In [None]:
np.mean(accuracy)*100

## 5) Evaluating Model on Validation Set

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
pred_valid_y = model.predict(X_val)
print(classification_report(y_val, pred_valid_y ))

In [None]:
print(confusion_matrix(y_val, pred_valid_y ))

# Submission

In [None]:
sub.head(10)

In [None]:
df=sub[['1','2','3','4']].mode(axis=1)# select the most frequent predicted class by our model
sub['target']=df[0]    
sub=sub[['id','target']]
sub['target']=sub['target'].apply(lambda x : int(x))

In [None]:
sub.to_csv('submission.csv',index=False)