# <span style = "color:green"> Twitter Sentiment Analysis </span>

***

Sentiment analysis refers to identifying as well as classifying the sentiments that are expressed in the text source. Tweets are often useful in generating a vast amount of sentiment data upon analysis. These data are useful in understanding the opinion of the people about a variety of topics.

Therefore we need to develop an Automated Machine Learning Sentiment analysis Model in order to compute the customer perception. Due to the presence of non-useful characters (collectively termed as the noise) along with useful data, it becomes difficult to implement models on them.

Here, We aim to analyze the sentiment of the tweets provided in the dataset by developing a machine learning pipeline involving the use of SVM classifier along with using Term Frequency-Inverse Document Frequency(TF-IDF). 

The dataset consist of 13870 tweets that have been extracted using the Twitter API. The dataset contains various columns but for this specific problem, we would only be using
   * Sentiment - Positive, Negative, Neutral
   * Text - Tweet

## Let's get Started

### Import Necessay Libraries

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
import re

### Read the dataset

In [None]:
df = pd.read_csv('twitter.csv')
df

### View head

In [None]:
df.head()

### View info of the dataset

In [None]:
df.info()

### Drop all columns exept 'text' and 'sentiment'

In [None]:
df = df[['text','sentiment']]

In [None]:
df.head()

### Check all the unique values in Sentiment

In [None]:
df['sentiment'].unique()

### Convert Neutral to 0, Positive to 1 and Negative to -1

In [None]:
def Convert(x):
    if x == 'Neutral':
        return 0
    elif x == 'Positive':
        return 1
    else:
        return -1

In [None]:
df = df.copy()
df['sentiment'] = df['sentiment'].apply(Convert)

In [None]:
df.head()

### Check for missing values

In [None]:
df.isna().sum()

### Check for Duplicates

In [None]:
df.duplicated().sum()

### Drop duplicate rows

In [None]:
df.drop_duplicates(keep = 'first',inplace = True)

In [None]:
df.duplicated().sum()

### View some of the tweets

In [None]:
for i in range(10):
    print(df['text'][i])

### Exploratory Data Analysis

### Plot a countplot of sentiment

In [None]:
sns.countplot(data = df, x = 'sentiment', palette='bright', hue = 'sentiment')

### Plot a piechart to show the percentile representation of sentiments

In [None]:
plt.pie(df['sentiment'].value_counts(), labels = ['Negative', 'Neutral','Positive'], autopct = '%0.2f')
plt.show()

### Define a function that preprocess the tweets

ie, 
* Remove all special characters
* Remove any stopwords
* Lemmatize the words

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

In [None]:
def preprocess(sentence):
    #removes all the special characters and split the sentence at spaces
    text = re.sub(r'[^0-9a-zA-Z]+',' ',sentence).split()
    
    # converts words to lowercase and removes any stopwords
    words = [x.lower() for x in text if x not in stopwords.words('english')]
    
    # Lemmatize the words
    lemma = WordNetLemmatizer()
    word = [lemma.lemmatize(word,'v') for word in words ]
    
    # convert the list of words back into a sentence
    word = ' '.join(word)
    return word

### Apply the function to our tweets column

In [None]:
df['text'] = df['text'].apply(preprocess)

### Print some of the tweets after preprocessing

In [None]:
print(df['text'])

### Assign X and y variables

In [None]:
for i in range(10):
    print(df['text'][i])

In [None]:
X = df['text']
y = df['sentiment']

### Transform X variable(tweets) using TF-IDF Vectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vector = TfidfVectorizer()

In [None]:
X = vector.fit_transform(X).toarray()

In [None]:
X.shape

### Split the data into training and testing set

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

### Check the shape of X_train and X_test

In [None]:
X_train.shape

In [None]:
X_test.shape

### Create a SVM Model

In [None]:
from sklearn.svm import SVC

### Train the model

In [None]:
model.fit(X_train,y_train)

### Check the score of the training set

In [None]:
model.score(X_train,y_train)

### Make prediction with X_test

In [None]:
y_pred = model.predict(X_test)

### Check the accuracy of our prediction

In [None]:
from sklearn import metrics

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
metrics.accuracy_score(y_test,y_pred)

### Plot confusion matrix on heatmap

In [None]:
metrics.confusion_matrix(y_test,y_pred)

In [None]:
sns.heatmap(metrics.confusion_matrix(y_test,y_pred), annot=True,fmt='0.1f')

### Print Classification report

In [None]:
print(metrics.classification_report(y_test,y_pred))

***

# <center><a href = "http://edure.in/"><span style = "color:CornflowerBlue; font-family:Courier New;font-size:40px">EDURE LEARNING</span></a></center>