<a href="https://colab.research.google.com/github/Ali-Asgar-Lakdawala/ML-Practice/blob/main/Support_Vector_Machine_Airline_tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <u><b> Objective </b></u>
## <b> You are given a data of US Airline tweets and their sentiment. The task is to do sentiment analysis about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). </b>

<br>
<br>

## <b>Things to do :</b>
* ### Read the tweets.csv data, clean and tokenize the tweets using nltk library.
* ### Count vectorize the tweets so that you end up with a sparse matrix (which will be your $X$). 
* ### You are supposed to build a SVM classifier (a binary classification in fact). Since the data contains three levels of sentiment(positive, negative and neutral), you should remove the sentences which are neutral. Once you do that you will have two classes only (positive and negative). You can set the label of positive tweets to 1 and negative tweets to 0.
* ### Once you have built the SVM classifier, evaluate this model across various metrics. Also plot the ROC curve and Precision-Recall curve. Report the areas under these two curves along with other metrics.
* ### Perform GridSearch cross validation for various values of $C$ and $gamma$. These will be the hyperparameters which you would play around with.
* ### Explain your observations and the underlying reasons for these.
* ### Try checking if <code>tfidfvectorizer</code> helps you gain lift in model's performance.





In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np

In [None]:
df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ML/Tweets.csv')

In [None]:
df

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0000,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0000,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0000,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0000,Can't Tell,1.0000,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14635,569587686496825344,positive,0.3487,,0.0000,American,,KristenReenders,,0,@AmericanAir thank you we got on a different f...,,2015-02-22 12:01:01 -0800,,
14636,569587371693355008,negative,1.0000,Customer Service Issue,1.0000,American,,itsropes,,0,@AmericanAir leaving over 20 minutes Late Flig...,,2015-02-22 11:59:46 -0800,Texas,
14637,569587242672398336,neutral,1.0000,,,American,,sanyabun,,0,@AmericanAir Please bring American Airlines to...,,2015-02-22 11:59:15 -0800,"Nigeria,lagos",
14638,569587188687634433,negative,1.0000,Customer Service Issue,0.6659,American,,SraJackson,,0,"@AmericanAir you have my money, you change my ...",,2015-02-22 11:59:02 -0800,New Jersey,Eastern Time (US & Canada)


In [None]:
airlines_names=df.airline.value_counts().index.to_list()

In [None]:
for i,value in enumerate(airlines_names) :
  value=value.replace(' ','')
  airlines_names[i]="@"+value


In [None]:
airlines_names.append('@AmericanAir')

In [None]:
airlines_names

['@United',
 '@USAirways',
 '@American',
 '@Southwest',
 '@Delta',
 '@VirginAmerica',
 '@AmericanAir']

In [None]:
df=df.loc[:,['airline_sentiment','text']]

In [None]:
df=df[df['airline_sentiment']!='neutral']

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords

In [None]:
stop = stopwords.words('english')+airlines_names

In [None]:
df.text = df.text.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer(max_df=0.95)
X = vectorizer.fit_transform(df.text)

In [None]:
Y=df['airline_sentiment']

In [None]:
from sklearn import svm

In [None]:
clf = svm.SVC()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.2, random_state=2)

In [None]:
clf.fit(X_train,y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [None]:
y_pred=clf.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
confusion_matrix(y_test,y_pred)

array([[1850,   23],
       [ 176,  260]])

In [None]:
clf.get_params

<bound method BaseEstimator.get_params of SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)>

In [None]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

    negative       0.91      0.99      0.95      1873
    positive       0.92      0.60      0.72       436

    accuracy                           0.91      2309
   macro avg       0.92      0.79      0.84      2309
weighted avg       0.91      0.91      0.91      2309



In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = {'C': [0.1, 1, 10, 100, 1000]} 

In [None]:
grid = GridSearchCV(clf, param_grid, refit = True, verbose = 3,n_jobs=-1,cv=5)

In [None]:
grid.fit(X_train,y_train)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:  1.6min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'C': [0.1, 1, 10, 100, 1000]}, pre_dispatch='2*n_jobs',
             refit=True, return_train_score=False, scoring=None, verbose=3)

In [None]:
y_pred=grid.predict(X_test)

In [None]:
sample=['the food not was bad but good']

In [None]:
sample1 = vectorizer.transform(sample)

In [None]:
grid.predict(sample1)

array(['negative'], dtype=object)

In [None]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

    negative       0.92      0.98      0.95      1873
    positive       0.89      0.64      0.74       436

    accuracy                           0.92      2309
   macro avg       0.91      0.81      0.85      2309
weighted avg       0.92      0.92      0.91      2309



In [None]:
msg_train,msg_test,label_train,label_test = train_test_split(df.text,df.airline_sentiment,test_size=0.2)

In [None]:
train_vectorized = vectorizer.transform(msg_train)
test_vectorized = vectorizer.transform(msg_test)

In [None]:
train_array= train_vectorized.toarray()
test_array = test_vectorized.toarray()

In [None]:
from sklearn.naive_bayes import GaussianNB
tweets_model = GaussianNB().fit(train_array,label_train)

In [None]:
train_preds = tweets_model.predict(train_array)
test_preds = tweets_model.predict(test_array)

In [None]:
# Confusion matrices for train and test 
print(confusion_matrix(label_test,test_preds))

[[1345  509]
 [ 158  297]]


In [None]:
# Print the classification report for train and test
print(classification_report(label_test,test_preds))

              precision    recall  f1-score   support

    negative       0.89      0.73      0.80      1854
    positive       0.37      0.65      0.47       455

    accuracy                           0.71      2309
   macro avg       0.63      0.69      0.64      2309
weighted avg       0.79      0.71      0.74      2309

