<a href="https://colab.research.google.com/github/Ali-Asgar-Lakdawala/ML-Practice/blob/main/Support_Vector_Machine_Airline_tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <u><b> Objective </b></u>
## <b> You are given a data of US Airline tweets and their sentiment. The task is to do sentiment analysis about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). </b>

<br>
<br>

## <b>Things to do :</b>
* ### Read the tweets.csv data, clean and tokenize the tweets using nltk library.
* ### Count vectorize the tweets so that you end up with a sparse matrix (which will be your $X$). 
* ### You are supposed to build a SVM classifier (a binary classification in fact). Since the data contains three levels of sentiment(positive, negative and neutral), you should remove the sentences which are neutral. Once you do that you will have two classes only (positive and negative). You can set the label of positive tweets to 1 and negative tweets to 0.
* ### Once you have built the SVM classifier, evaluate this model across various metrics. Also plot the ROC curve and Precision-Recall curve. Report the areas under these two curves along with other metrics.
* ### Perform GridSearch cross validation for various values of $C$ and $gamma$. These will be the hyperparameters which you would play around with.
* ### Explain your observations and the underlying reasons for these.
* ### Try checking if <code>tfidfvectorizer</code> helps you gain lift in model's performance.





In [238]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [239]:
import pandas as pd
import numpy as np


In [240]:
df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ML/Tweets.csv')

In [None]:
df

In [242]:
airlines_names=df.airline.value_counts().index.to_list()

In [243]:
for i,value in enumerate(airlines_names) :
  value=value.replace(' ','')
  airlines_names[i]="@"+value


In [244]:
airlines_names.append('@AmericanAir')

In [245]:
airlines_names

['@United',
 '@USAirways',
 '@American',
 '@Southwest',
 '@Delta',
 '@VirginAmerica',
 '@AmericanAir']

In [246]:
df=df.loc[:,['airline_sentiment','text']]

In [247]:
df=df[df['airline_sentiment']!='neutral']

In [248]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [249]:
from nltk.corpus import stopwords

In [250]:
stop = stopwords.words('english')+airlines_names

In [251]:
df.text = df.text.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [252]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [253]:
vectorizer = TfidfVectorizer(max_df=0.95)
X = vectorizer.fit_transform(df.text)

In [254]:
from sklearn.model_selection import train_test_split
msg_train,msg_test,label_train,label_test = train_test_split(df.text,df.airline_sentiment,test_size=0.2)

In [255]:
train_vectorized = vectorizer.transform(msg_train)
test_vectorized = vectorizer.transform(msg_test)

In [256]:
train_array= train_vectorized.toarray()
test_array = test_vectorized.toarray()

In [257]:
from sklearn.naive_bayes import GaussianNB
spam_detect_model = GaussianNB().fit(train_array,label_train)

In [258]:
train_preds = spam_detect_model.predict(train_array)
test_preds = spam_detect_model.predict(test_array)

In [259]:
from sklearn.metrics import classification_report,confusion_matrix

In [260]:
# Confusion matrices for train and test 
print(confusion_matrix(label_test,test_preds))

[[1376  493]
 [ 150  290]]


In [261]:
print(classification_report(label_test,test_preds))

              precision    recall  f1-score   support

    negative       0.90      0.74      0.81      1869
    positive       0.37      0.66      0.47       440

    accuracy                           0.72      2309
   macro avg       0.64      0.70      0.64      2309
weighted avg       0.80      0.72      0.75      2309



In [237]:
# Print the classification report for train and test
print(classification_report(label_test,test_preds))

              precision    recall  f1-score   support

    negative       0.94      0.51      0.66      1843
    positive       0.31      0.88      0.46       466

    accuracy                           0.58      2309
   macro avg       0.63      0.69      0.56      2309
weighted avg       0.82      0.58      0.62      2309

