##  Introduction
   This project is to classify the twitter data provided for each user and to predict the Gender of the user ID provided in the test dataset.
   To perform this task following steps are followed:
   
   * After loading Test and Train dataset into program the Userid is extracted from both into a list.
   * Twitter data for each userid is fecthed from the xml file provided for each user.
   * Text processing and the feature extraction process is done for tweets once after we extract data for all user from the unstructured data file and convert the extracted data into suitable format.
   * After preprocessing, we split user data into train and test based on the user ID provided in Train and Test dataset respectively.
   * Using the preprocessed data and extracted feature the classifier model is built to predict the gender of the user  data is fed into classifier to predict


### Importing Necessary packages

In [3]:
# !pip install emoji
import pandas as pd
import xml.dom.minidom as x
import nltk
from nltk.corpus import reuters
import re
from nltk.corpus import wordnet
import os
# from emoji import UNICODE_EMOJI
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer
import itertools
import collections
from nltk import FreqDist

from sklearn import svm, tree
from sklearn.metrics import accuracy_score, f1_score
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

In [4]:
def read_xml(filename):  #function to load tweets for each user from xml file.
    xml_data = x.parse(filename)  # parsing the xml file of the user
    document_data = xml_data.getElementsByTagName('document') #extracting only document tag to get tweets
    user_data = []
#     print(document_data)
    for data in document_data: #looping through all the document tag available(100 documents for each user)
        temp_data = data.firstChild.data.lower() #getting data from the taqg
#         print(temp_data)
        user_data.append(temp_data) #appending tag into list
    return user_data

In [5]:
import nltk
from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
def get_tokenize_tweet(tweet):    #function to tokenise the tweet by keeping only the words with alphabets
    token_tweet = []
    token_no_stopwords = []
#     tweet = tweet.lower()
    tweet = re.sub(r'@(\w+)+',' ',tweet)  # removing mentions from tweet
    tweet = re.sub(r"#(\w+)",' ',tweet) #removing mentions from tweet
    tweet = re.sub('[\'_]',' ',tweet)
    tweet = re.sub(r"http[s]?://(?:\w+|\d+|[$-_@.&+]|[!*\(\),]|(?:%[\d\w][\d\w]))+",' ',tweet) # removing url from tweet
    tokenizer = RegexpTokenizer("[A-Za-z]\w+(?:[-'?]\w+)?")
    temp_tweet = list(tokenizer.tokenize(tweet))
    for word in temp_tweet:
        word = word.lower()
        if len(word) > 1:
            token_tweet.append(word)
            if word not in stopwords_list:
                token_no_stopwords.append(word)                   
    return token_tweet,token_no_stopwords

In [6]:
import nltk
from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
def remove_stopwords(tweet):
    stop_removed_tweet = []
    for word in tweet:
        if word not in stopwords_list:
            stop_removed_tweet.append(word)
    return stop_removed_tweet


In [16]:
os. getcwd()

'/Users/neeraj/Neeraj/Data Science Projects/Twitter Sentiment Analysis/Assessment2_data'

#### Loading file 

In [18]:
path, dirs, files = next(os.walk("/Users/neeraj/Neeraj/Data Science Projects/Twitter Sentiment Analysis/Assessment2_data/data/")) #extracting all the filename from data folder
filesname_list = []
for file in files:
    if file.endswith('.xml'):
        filesname_list.append(file)
file_count = len(set(filesname_list))
print('Number of user documents present in data folder:',file_count)

Number of user documents present in data folder: 3600


In [20]:
train_data = pd.read_csv("/Users/neeraj/Neeraj/Data Science Projects/Twitter Sentiment Analysis/Assessment2_data/train_labels.csv") #reading Train and test data
test_data = pd.read_csv("/Users/neeraj/Neeraj/Data Science Projects/Twitter Sentiment Analysis/Assessment2_data/test.csv")
train_user_id_list = list(train_data['id'])
test_user_id_list = list(test_data['id'])
user_id_list = train_user_id_list  + test_user_id_list
print('Number of users combining Train and Test data:',(len(set(user_id_list))))

Number of users combining Train and Test data: 3600


In [21]:
user_lam = lambda file: re.sub('.xml','',file) # getting list of users by removing .xml from the filename list
doc_userid_list = list(map(user_lam, files))

#### Reading xml file of all the users

In [22]:
%%time
user_data_list = []  
user_id_list = []
for file in filesname_list:
    user_id = re.sub('.xml','',file) #extracting user ID from filenames
    file_path = path + file  #creating path of file 
    temp_user_data = read_xml(file_path)   #reading xml file data
    user_data_list.append(temp_user_data)
    user_id_list.append(user_id)
tweet_data_dict = dict(zip(user_id_list,user_data_list))  #creating dictionary using UserID as key and user tweets as value

CPU times: user 4.44 s, sys: 1.12 s, total: 5.56 s
Wall time: 11.4 s


In [23]:
len(tweet_data_dict)

3600

#### Preprocessing Tweets 

Steps in preprocessing:
* Extracting all 100 tweets for each user.
* for each tweets remove hashtags.
* Remove URLS.
* Remove mentions.
* Remove spaces.
* Extract words only with alpahbets by removing numbers and special characters.
* Create list with all the words for each tweet [each tweet appened as new list].
* Using above list, create bag of tweets(to create corpus) for each user [Combining all 100 tweets into list].
* Create list with tweet word which are not stopwords.
* Using stopword removed list, create bag of stopwords removed tweet words for each user.

By using the four list createdd in above step and using userID, create a dataframe

In [24]:
%%time

tweet_list = [] # tweet list for each user, which save list of tweets
tweet_no_stopwords_list = [] # list in stopword removed tweets
tweet_bag = [] #list is combine words from all 100 tweets for each user
tweet_no_stopwords_bag = []
user_id_list = []

for user,data in tweet_data_dict.items(): #looping through each key and value in dictionary
    temp_tweet = []  
    temp_no_stopwords = []
    tweet_corpus = []
    tweet_no_stopwords_corpus = []
    for tweet in data:   #looping through each tweet for a user
        tweet_token,token_no_stopwords = get_tokenize_tweet(tweet) #extracting preprocessed tweet and stopword removed tweets
        temp_tweet.append(tweet_token) #list of 100 preprocessed tweets separetly.
        tweet_corpus += tweet_token #combining all 100 preprocessed tweets into 1 list.
        
    
    #appending preprocessed data of a user into list
    user_id_list.append(user)
    tweet_list.append(temp_tweet)
    tweet_bag.append(tweet_corpus)
#creating dataframe using above list and user id   
tweet_data_df = pd.DataFrame(list(zip(user_id_list,tweet_list,tweet_bag)),
                             columns=['user_id','tweets','tweet_bag'])


CPU times: user 15 s, sys: 883 ms, total: 15.8 s
Wall time: 21.1 s


In [25]:
tweet_data_df.head()

Unnamed: 0,user_id,tweets,tweet_bag
0,b91efc94c91ad3f882a612ae2682af17,"[[news, flash, popcorn-flavored, tic-tacs, tas...","[news, flash, popcorn-flavored, tic-tacs, tast..."
1,8ebb5b1633c16c5636f24bbfb70d26bb,"[[another, superb, episode, while, can, wait, ...","[another, superb, episode, while, can, wait, t..."
2,ff91e6d4b79fc64072ae273aa3fed77e,"[[no, rod, city, has, continues, to, advocate,...","[no, rod, city, has, continues, to, advocate, ..."
3,7e199c5885131a2579429c07f3215cbc,"[[separation, of, church, and, state], [all, t...","[separation, of, church, and, state, all, the,..."
4,cdc2d20d75f8187ee54caf56b2c77626,"[[first, impressions, with, tina, fey], [warni...","[first, impressions, with, tina, fey, warning,..."


##### Feature Extraction
Step 1: creating bigrams for all the words in tweet data 

In [26]:
#Taking only top 200 bigram which is having frequncy morethan 20.
# Bigrams for words which are not stopwords
# Creating bigram with word lenght less than 3
stp_word_list = stopwords.words()
clean_tweets = tweet_data_df['tweet_bag']
all_tweets = list(itertools.chain(*clean_tweets))
bigram_list = []
bigram_measure = nltk.collocations.BigramAssocMeasures() #creating bigrams
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(all_tweets)
bigram_finder.apply_freq_filter(20) #filtering with frequency of 20
bigram_finder.apply_word_filter(lambda w: len(w) < 3)# or w.lower() in ignored_words)
bigram_finder.apply_ngram_filter(lambda w1, w2: w1 == w2)  #filetring bigram if both words are same
bigram_finder.apply_ngram_filter(lambda w1, w2: w1  in stp_word_list or w2  in stp_word_list) #removing stopwords from bigram
top_bigram_list = bigram_finder.nbest(bigram_measure.pmi, 200) # Selecting top 200 bigrams
for value in top_bigram_list:
    bigram_list.append(value[0]+"_"+value[1])  #extracting list of bigrams

In [27]:
bigram_tokenizer = MWETokenizer(mwes=top_bigram_list,separator= "_") #Adding bigram words to tokenizer with separator "_"
user_tweets = tweet_data_df['tweet_bag']
bigram_tweets_list = []
for tweet in user_tweets: #looping through bag of user data
    bigram_tweets_list.append(bigram_tokenizer.tokenize(tweet)) #creating bigrams using tokenizer    
print('Lenght of user_tweets after adding bigrams:',len(bigram_tweets_list))

Lenght of user_tweets after adding bigrams: 3600


Step 2: creating trigrams for all the words in tweet data 

In [28]:
#Taking only top 200 trigrams which is having frequncy morethan 20.
# Trigrams for words which are not stopwords
# Creating trigram with word lenght less than 3
# Not creating the Trigram,If any two words in the trigrams are same.
trigram_list = []
all_tweets = list(itertools.chain(*bigram_tweets_list))
trigram_measure = nltk.collocations.TrigramAssocMeasures() #creating trigrams
trigram_finder = nltk.collocations.TrigramCollocationFinder.from_words(all_tweets)
trigram_finder.apply_freq_filter(20) #filtering with frequency of 20
trigram_finder.apply_word_filter(lambda words: len(words) < 3)# or w.lower() in ignored_words)
trigram_finder.apply_ngram_filter(lambda word1,word2,word3: word1 == word2 or word2==word3 or word1==word3)  #filetring bigram if both words are same
trigram_finder.apply_ngram_filter(lambda word1,word2,word3: word1 in stp_word_list or word2 in stp_word_list or word3 in stp_word_list) #removing stopwords from trigram
top_trigram_list = trigram_finder.nbest(trigram_measure.pmi, 200) # Selecting top 200 bigrams
for value in top_trigram_list: 
    trigram_list.append(value[0]+"_"+value[1]+"_"+value[2]) #Creating Trigrams list with'_' as seperator
     

In [29]:
trigram_tokenizer = MWETokenizer(mwes=top_trigram_list,separator= "_") #Adding Trigram words to tokenizer with separator "_"
trigram_tweets_list = []
for tweet in bigram_tweets_list: #looping through tweets for which bigrams are created
    trigram_tweets_list.append(trigram_tokenizer.tokenize(tweet))#creating trigrams using tokenizer 
print('Lenght of user_tweets after adding bigrams:',len(trigram_tweets_list))  


Lenght of user_tweets after adding bigrams: 3600


If we remove stopwords before creating Ngrams, it will leads to __wrong set of Ngrams__.
So removing it after __extracting Ngrams__ from the tweets

In [30]:
sw_removed_tweets = []
for user_tweets in trigram_tweets_list:
    temp_tweet = remove_stopwords(user_tweets)
    sw_removed_tweets.append(temp_tweet)
len(sw_removed_tweets)

3600

In [31]:
tweet_data_df['sw_removed_tweets'] = sw_removed_tweets #creating new column with Ngrams extracted and stopwords removed tweets for each user
tweet_data_df.head()

Unnamed: 0,user_id,tweets,tweet_bag,sw_removed_tweets
0,b91efc94c91ad3f882a612ae2682af17,"[[news, flash, popcorn-flavored, tic-tacs, tas...","[news, flash, popcorn-flavored, tic-tacs, tast...","[news, flash, popcorn-flavored, tic-tacs, tast..."
1,8ebb5b1633c16c5636f24bbfb70d26bb,"[[another, superb, episode, while, can, wait, ...","[another, superb, episode, while, can, wait, t...","[another, superb, episode, wait, see, plays, w..."
2,ff91e6d4b79fc64072ae273aa3fed77e,"[[no, rod, city, has, continues, to, advocate,...","[no, rod, city, has, continues, to, advocate, ...","[rod, city, continues, advocate, bc, assessmen..."
3,7e199c5885131a2579429c07f3215cbc,"[[separation, of, church, and, state], [all, t...","[separation, of, church, and, state, all, the,...","[separation, church, state, administration, li..."
4,cdc2d20d75f8187ee54caf56b2c77626,"[[first, impressions, with, tina, fey], [warni...","[first, impressions, with, tina, fey, warning,...","[first, impressions, tina, fey, warning, foota..."


Step 3: Using TFIDF of the tweets, the less frequent and more frequent word for removed from the tweets

In [32]:
all_words = list(itertools.chain(*sw_removed_tweets))
print('Total Number of words:',len(all_words))

Total Number of words: 2379380


In [33]:
#extracting unique words for each user tweets
user_unique_word = [set(tweet) for tweet in tweet_data_df['sw_removed_tweets'] ]
print('Number of users',len(user_unique_word))

Number of users 3600


In [None]:
all_tweet_unique_words = list(itertools.chain(*user_unique_word))
# len(all_tweet_unique_words)

In [None]:
# calculating frequency distribution(TFIDF for each word)
word_freq_list = FreqDist(all_tweet_unique_words) 
word_freq_dict = dict(word_freq_list)

In [None]:
word_freq_list = list(dict(word_freq_list).items()) 
word_freq_df = pd.DataFrame(word_freq_list,columns=['words','doc_count']) #creating DF using frequency distribution dictionary

In [None]:
word_freq_df = word_freq_df.sort_values(by='doc_count') #sorting data to count of words

In [None]:
print('Total number of unique words:',len(word_freq_list))

In [None]:
word_freq_df[:10] #Top 10 least frequent words

In [None]:
word_freq_df[-10:] #Top 10 most frequent words

##### Removing words appear in 95%[Number of user:3420]  and 5%[Number of user: 180] of usertweets

In [None]:
total_user = len(tweet_data_df)
len(word_freq_df[(word_freq_df["doc_count"]> int(total_user*0.95))] )


there are no words appearing in 3420 (95%) of users tweets, so making the maximum threshold as 85%

In [None]:
word_freq_df[(word_freq_df["doc_count"]> int(total_user*0.85))] #words available in 85% for users tweet

In [None]:
len(word_freq_df[(word_freq_df["doc_count"]< int(total_user*0.01))]) #count of words avaiable only in 5% users tweet

Well, this is helpful. more words appears only in  180 (5%) users tweets, Removing of this words will help our model to perform well with less time

In [None]:
#Extracting frequent words from the dataframe based on th threshold dataframe
threshold_word_list = list(word_freq_df[(word_freq_df["doc_count"]> int(total_user*0.85))]['words'])
threshold_word_list += list(word_freq_df[(word_freq_df["doc_count"] < int(total_user*0.05))]['words'])

In [None]:
print('Number of words comes in the provided document threshold:',len(threshold_word_list))

#### Removing the words which are in the given document threshold for all the users tweet

* As the threshold words contains more word, it will take more time to comapre words in for loop using the list.
* So to save the time, we are taking the intersection of the tweet words set and the set of threshold words

In [None]:
%%time
freq_word_rm_list = []
sw_removed_tweets_list = list(tweet_data_df['sw_removed_tweets']) # cleaned words list to remove the threshold words
freq_rm_tweets_list = []
for user_tweet in sw_removed_tweets_list: #looping through all the  tweets
    #extracting only words from tweets, which are not in threshold words
    # converting both lists into set for easy comparision using set intersection
    words = list(set(user_tweet) - set(threshold_word_list)) 
    user_tweet = [word for word in user_tweet if word in words]
    freq_rm_tweets_list.append(user_tweet)


In [None]:
len(freq_rm_tweets_list)

In [None]:
tweet_data_df['freq_rm_tweets'] = freq_rm_tweets_list
tweet_data_df.head()

### Model selection 

To select the classifier with best accuracy and performance, we are using 6 classifiers on the data.

Classifiers are:
* LogisticRegression
* SVC
* LinerSVC
* DecisionTree classifer
* RandomForest classifier
* MLPClassifier

All the model for build on the data extracted from the four different method of vectorisors:

Feature vector extract method:
1. TFIDF vectoriser with binary parameter as False
2. TFIDF vectoriser with binary parameter as True
3. Count vectoriser with binary parameter as True
4. Count vectoriser with binary parameter as False

All 6 classifiers are build on each feature extraction method. So we build 4(method)*6(classifier) = __24 models__ to select the best classifier and the vectorising method in the following setps

In [None]:
#Splitting data into train and test dataset based in the User ID provided in train and test lable file
test_labels = pd.read_csv('Assessment2_data/test_labels.csv')
train_df= pd.merge(train_data, tweet_data_df, left_on='id',right_on='user_id')
test_df = pd.merge(test_labels, tweet_data_df, left_on='id',right_on='user_id')

In [None]:
# Function to fit and get accuracy and F1_score for the model 
def get_accuracy_model(model,train_x, test_x, train_y, test_y):
    model.fit(train_x,train_y) #fit model using train_x and train_y dataset
    predict_y = model.predict(test_x) #opredicting y value 
    accuracy = accuracy_score(test_y,predict_y) # calculating accuracy
    f1_score1 = f1_score(test_y,predict_y,average='macro')
    return (accuracy,f1_score1)


In [None]:
# funtion to create all 6 models which are using in this task and to get accuracy for each of them
def create_all_model(train_x,train_y,test_x,test_y,node_classification):
        
    node_classification_list = []
    model_name_list = ['LogisticRegression','SVC','LinearSVC','DecisionTree','RandomForest','MLPClassifier']
    accuracy_score_list = []
    f1_score_list = []
    accu_result = []
    #splitting data into train and test data
    lr_model = LogisticRegression() #creating LogisticRegression model
    # getting accuracy and F1_score for LogisticRegression
    accu_result.append(get_accuracy_model(lr_model,train_x, test_x, train_y, test_y))
    
    svc_model = SVC() #creating SVC model
    #getting accuracy and F1_score for SVC
    accu_result.append(get_accuracy_model(svc_model,train_x, test_x, train_y, test_y))
    
    lsvc_model = LinearSVC() #creating LinearSVC model
    #getting accuracy and F1_score for LinearSVC
    accu_result.append(get_accuracy_model(lsvc_model,train_x, test_x, train_y, test_y))
  
    dt_model = DecisionTreeClassifier(criterion='entropy') #creating Decision Tree model
    #getting accuracy and F1_score for Decision Tree model    
    accu_result.append(get_accuracy_model(dt_model,train_x, test_x, train_y, test_y))

    rf_model = RandomForestClassifier()#creating RandomForest classifier
    #getting accuracy and F1_score for RandomForest classifier
    accu_result.append(get_accuracy_model(rf_model,train_x, test_x, train_y, test_y))
    
    ml_model = MLPClassifier(hidden_layer_sizes=(400,),alpha=0.9)
    accu_result.append(get_accuracy_model(ml_model,train_x, test_x, train_y, test_y))
    
    
    for accuracy in accu_result: #looping through accuracy of each classifier
        node_classification_list.append(node_classification) #appending node_classification method name
        accuracy_score_list.append(accuracy[0]) #accuracy into list
        f1_score_list.append(accuracy[1]) #F1_score into list
    #result dataframe for the classifier build using the particular node classifier
    accu_df = pd.DataFrame(list(zip(node_classification_list,model_name_list,accuracy_score_list,f1_score_list)),
                         columns = ['Feature_extraction type','Classifier','Accuracy','F1_score'])
    return accu_df




In [None]:
%%time
# converting the bag of words for each users tweet into str, to generate feature vectoe
train_tweets = []
test_tweets = []
for tweet in train_df['freq_rm_tweets']:
    strn = ' '.join(word for word in tweet)
    train_tweets.append(strn)
    
for tweet in test_df['freq_rm_tweets']:
    strn = ' '.join(word for word in tweet)
    test_tweets.append(strn)
    

# extracting y labels from the train and test data   
y_train = train_df['gender']
y_test = test_df['gender']




#### Method 1: Tdidf with binary as False

In [42]:
# %%time
# tfidfconverter = TfidfVectorizer(binary=False)
# x_train = tfidfconverter.fit_transform(train_tweets).toarray()
# x_test = tfidfconverter.fit_transform(test_tweets).toarray()    
# tdidf_false_df = create_all_model(x_train,y_train,x_test,y_test,'TFIDF with Binary: False')

CPU times: user 54.1 s, sys: 2.82 s, total: 56.9 s
Wall time: 41.2 s


In [43]:
# tdidf_false_df

Unnamed: 0,Feature_extraction type,Classifier,Accuracy,F1_score
0,TFIDF with Binary: False,LogisticRegression,0.816,0.815953
1,TFIDF with Binary: False,SVC,0.496,0.331551
2,TFIDF with Binary: False,LinearSVC,0.776,0.775943
3,TFIDF with Binary: False,DecisionTree,0.612,0.611901
4,TFIDF with Binary: False,RandomForest,0.654,0.651175
5,TFIDF with Binary: False,MLPClassifier,0.806,0.805589


In this method Logistic Regression is performing better with 0.816 accuracy

#### Method 2: Tdidf with binary as True

In [44]:
# %%time
# tfidfconverter = TfidfVectorizer(binary=True)
# x_train = tfidfconverter.fit_transform(train_tweets).toarray()
# x_test = tfidfconverter.fit_transform(test_tweets).toarray()
# tdidf_true_df =create_all_model(x_train,y_train,x_test,y_test,'TFIDF with Binary: True')

CPU times: user 51.2 s, sys: 2.64 s, total: 53.8 s
Wall time: 40.5 s


In [45]:
# tdidf_true_df

Unnamed: 0,Feature_extraction type,Classifier,Accuracy,F1_score
0,TFIDF with Binary: True,LogisticRegression,0.806,0.805825
1,TFIDF with Binary: True,SVC,0.496,0.331551
2,TFIDF with Binary: True,LinearSVC,0.77,0.769925
3,TFIDF with Binary: True,DecisionTree,0.624,0.622642
4,TFIDF with Binary: True,RandomForest,0.638,0.633886
5,TFIDF with Binary: True,MLPClassifier,0.786,0.785278


In this method performance Logistic Regression is slightly decresed and best performing model is MLP classifier with  better with accuracy of 0.81

#### Method 3: count vectorizer with binary as True

In [46]:
# %%time
# count_vect = CountVectorizer(binary=True)
# x_train = count_vect.fit_transform(train_tweets).toarray()
# x_test = count_vect.fit_transform(test_tweets).toarray()
# cnt_vec_true_df = create_all_model(x_train,y_train,x_test,y_test,'Count_vec with Binary: True')

CPU times: user 49.4 s, sys: 2.88 s, total: 52.3 s
Wall time: 36.4 s


In [47]:
# cnt_vec_true_df

Unnamed: 0,Feature_extraction type,Classifier,Accuracy,F1_score
0,Count_vec with Binary: True,LogisticRegression,0.72,0.719996
1,Count_vec with Binary: True,SVC,0.792,0.79188
2,Count_vec with Binary: True,LinearSVC,0.682,0.681938
3,Count_vec with Binary: True,DecisionTree,0.658,0.657988
4,Count_vec with Binary: True,RandomForest,0.656,0.653784
5,Count_vec with Binary: True,MLPClassifier,0.752,0.751996


Unbeliveable, the performance of SVC increased significantly in this method and it is best model with 0.79 accuracy

#### Method 4 : count vectorizer with binary as False

In [48]:
# %%time 
# count_vect = CountVectorizer(binary=False)
# x_train = count_vect.fit_transform(train_tweets).toarray()
# x_test = count_vect.fit_transform(test_tweets).toarray()
# cnt_vec_false_df = create_all_model(x_train,y_train,x_test,y_test,'Count_vec with Binary: False')

CPU times: user 48.7 s, sys: 2.91 s, total: 51.6 s
Wall time: 36.7 s


In [49]:
# cnt_vec_false_df

Unnamed: 0,Feature_extraction type,Classifier,Accuracy,F1_score
0,Count_vec with Binary: False,LogisticRegression,0.746,0.745999
1,Count_vec with Binary: False,SVC,0.808,0.807923
2,Count_vec with Binary: False,LinearSVC,0.736,0.736
3,Count_vec with Binary: False,DecisionTree,0.64,0.639024
4,Count_vec with Binary: False,RandomForest,0.698,0.695956
5,Count_vec with Binary: False,MLPClassifier,0.784,0.783945


Again the SVC model beats all other model with 0.80 accuracy, in this method.

we can observe that the SVC classifier is performing better with countvectoriser

#### Selecting the best model and feature vector method based on the accuracy and F1_score

In [50]:
# res = [tdidf_true_df,tdidf_false_df,cnt_vec_false_df,cnt_vec_true_df] #merging result of all methods into one 
# final_res_df = pd.concat(res,ignore_index=True)

In [51]:
# print('Number of models build for model and feature vector method selection:',len(final_res_df))

Number of models build for model and feature vector method selection: 24


In [52]:
# final_res_df.sort_values(['Accuracy','F1_score'],ascending=False)

Unnamed: 0,Feature_extraction type,Classifier,Accuracy,F1_score
6,TFIDF with Binary: False,LogisticRegression,0.816,0.815953
13,Count_vec with Binary: False,SVC,0.808,0.807923
0,TFIDF with Binary: True,LogisticRegression,0.806,0.805825
11,TFIDF with Binary: False,MLPClassifier,0.806,0.805589
19,Count_vec with Binary: True,SVC,0.792,0.79188
5,TFIDF with Binary: True,MLPClassifier,0.786,0.785278
17,Count_vec with Binary: False,MLPClassifier,0.784,0.783945
8,TFIDF with Binary: False,LinearSVC,0.776,0.775943
2,TFIDF with Binary: True,LinearSVC,0.77,0.769925
23,Count_vec with Binary: True,MLPClassifier,0.752,0.751996


* From the above data, we can clearly say that the __Logistic Regression__ is giving better result with __TFIDF with binary Flase__ Vectorriser .
* Accuracy of this model is 0.816

#### Tuning  model 
* Logistic Regression model to select the best hyperparamter for the model, to get best result.
* This will be done using __Grid search cross validation method__ (GridseachCV)


In [67]:
%%time
#creting feature vector using selected best vectorising method(TDIDF with binary False)
tfidfconverter = TfidfVectorizer(binary=False)
x_train = tfidfconverter.fit_transform(train_tweets).toarray()
x_test = tfidfconverter.fit_transform(test_tweets).toarray()


pipeline = Pipeline([('classifier' , LogisticRegression())])


# Creating param grid with multiple values for the parameters
param_grid = [
    {
     'classifier__penalty' : ['l1', 'l2'],
     'classifier__C' : list(range(2, 20, 2)),
     'classifier__solver' : ['liblinear']},]

# Create grid search object
clf = GridSearchCV(pipeline, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)

# Fit on data on GridSearchCV model
best_clf = clf.fit(x_train, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.3s


CPU times: user 1.27 s, sys: 98.3 ms, total: 1.37 s
Wall time: 10.8 s


[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:    9.8s finished


In [54]:
clf.best_estimator_.get_params() #displaying the best parameters to get better result/accuracy

{'memory': None,
 'steps': [('classifier',
   LogisticRegression(C=2, class_weight=None, dual=False, fit_intercept=True,
             intercept_scaling=1, max_iter=100, multi_class='warn',
             n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
             tol=0.0001, verbose=0, warm_start=False))],
 'classifier': LogisticRegression(C=2, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
           tol=0.0001, verbose=0, warm_start=False),
 'classifier__C': 2,
 'classifier__class_weight': None,
 'classifier__dual': False,
 'classifier__fit_intercept': True,
 'classifier__intercept_scaling': 1,
 'classifier__max_iter': 100,
 'classifier__multi_class': 'warn',
 'classifier__n_jobs': None,
 'classifier__penalty': 'l2',
 'classifier__random_state': None,
 'classifier__solver': 'liblinear',
 'classifier__tol': 0.0001,
 'classi

It tells best parameters that Logistic model are:
 * C is 2
 * penalty is 'l2' and so on.
 Now creating model with the above parameter value

In [55]:
model = LogisticRegression(C=2, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
           tol=0.0001, verbose=0, warm_start=False)

In [56]:
# m.fit(x_train,y_train)
model = model.fit(x_train,y_train)
prediction = model.predict(x_test)
accuracy,f1_score = get_accuracy_model(model,x_train,x_test,y_train,y_test)
print('Accuracy of the Tuned model is :',accuracy)
print('F1_score of the Tuned model is :',f1_score)

Accuracy of the Tuned model is : 0.822
F1_score of the Tuned model is : 0.8219992879971519


##### Predticing gender for the user in test data using selected model

In [57]:
model = model.fit(x_train,y_train)
prediction = model.predict(x_test)
test_df['prediction'] = prediction #attaching prediction into test dataframe

In [58]:
test_df.head()

Unnamed: 0,id,gender,user_id,tweets,tweet_bag,sw_removed_tweets,freq_rm_tweets,prediction
0,d6b08022cdf758ead05e1c266649c393,male,d6b08022cdf758ead05e1c266649c393,"[[what, odds, he, stops, whining, and, goes, o...","[what, odds, he, stops, whining, and, goes, ou...","[odds, stops, whining, goes, gets, proper, job...","[goes, gets, proper, job, rest, us, would, ima...",male
1,9a989cb04766d5a89a65e8912d448328,female,9a989cb04766d5a89a65e8912d448328,"[[bingay, won, cool, handy, tonight], [], [we,...","[bingay, won, cool, handy, tonight, we, made, ...","[bingay, cool, handy, tonight, made, summercit...","[cool, tonight, made, work, lunch, bay, beauti...",female
2,2a1053a059d58fbafd3e782a8f7972c0,male,2a1053a059d58fbafd3e782a8f7972c0,"[[the, cynical, manipulation, of, voters, desi...","[the, cynical, manipulation, of, voters, desir...","[cynical, manipulation, voters, desire, honest...","[voters, honest, government, politics, climate...",male
3,6032537900368aca3d1546bd71ecabd1,male,6032537900368aca3d1546bd71ecabd1,"[[cannot, convert, to, object, on, sony, braav...","[cannot, convert, to, object, on, sony, braavi...","[cannot, convert, object, sony, braavia, happe...","[cannot, happening, week, idea, behind, every,...",male
4,d191280655be8108ec9928398ff5b563,male,d191280655be8108ec9928398ff5b563,"[[cat, is, kneading, maniac, floppycats], [the...","[cat, is, kneading, maniac, floppycats, the, l...","[cat, kneading, maniac, floppycats, left, goes...","[cat, left, goes, trump, giving, free, ride, p...",male


#### Extracting data to creating predict_label file

In [59]:
final_df = test_df[['id','prediction']] # extracting id and predicted to save data 
kaggle_df = test_df[['id','prediction']] # extracting id and predicted gender for kaggle submission

In [60]:
final_df.columns = ['id','gender']
kaggle_df.columns = ['id','gender']
final_df.head()

Unnamed: 0,id,gender
0,d6b08022cdf758ead05e1c266649c393,male
1,9a989cb04766d5a89a65e8912d448328,female
2,2a1053a059d58fbafd3e782a8f7972c0,male
3,6032537900368aca3d1546bd71ecabd1,male
4,d191280655be8108ec9928398ff5b563,male


In [61]:
final_df.to_csv('predict_label.csv',index=False) #saving result in predict_labels.csv file

##### Extracting data to kaggle submission

In [62]:
# kaggle_df = kaggle_df[0:500]

In [63]:
# kaggle_df.head()

Unnamed: 0,id,gender
0,d6b08022cdf758ead05e1c266649c393,male
1,9a989cb04766d5a89a65e8912d448328,female
2,2a1053a059d58fbafd3e782a8f7972c0,male
3,6032537900368aca3d1546bd71ecabd1,male
4,d191280655be8108ec9928398ff5b563,male


In [64]:
# kaggle_df['gender'] = kaggle_df['gender'].replace('female',0 ) # as specified converting female as o and male as 1 
# kaggle_df['gender'] = kaggle_df['gender'].replace('male',1 )

In [65]:
# kaggle_df.head()

Unnamed: 0,id,gender
0,d6b08022cdf758ead05e1c266649c393,1
1,9a989cb04766d5a89a65e8912d448328,0
2,2a1053a059d58fbafd3e782a8f7972c0,1
3,6032537900368aca3d1546bd71ecabd1,1
4,d191280655be8108ec9928398ff5b563,1


In [66]:
# kaggle_df.to_csv('predict_label_kaggle.csv',index=False)