# Assignment 5. Machine Learning and Natural Language Processing

OPIM 5894 Data Science with Python

Name: Karpagam Thamaya Vinayagam  NetID: ktv16001

Discussed with: Varnika Yertha

## Instructions
In this assignment, you are asked to predict genders of users using their public information on websites. In question 1, you are asked to predict gender using only usename. In question 2, you are asked to predict gender using the profile description of a user instead. Finally, you may combine all available information of users to make predictions. You may explore different models and different combination of features, as well as different ways to transform features, to achieve best performance. 
<br> <br>
- It is recommended to use NLTK for this classification task, as the features stored in dictionary style can be easily extended. While scikit-learn is easier for Q2, it might not be that straightforward to combine different features in Q3. In addition, dealing with categorical variables can be a pain in scikit-learn. If you plan to use scikit-learn anyway, please read the following post: http://pbpython.com/categorical-encoding.html
- While protyping, it is easier to stick to the Naive Bayes Classifier. Adding other classifiers once your code is bug-free.
- Use cross validation on the training set to avoid over-fitting, though it is not guaranteed achieve that purpose.


<br>
This assignment involves the following challenges:
- Construct features from strings (i.e., usernames)
- Frequent use of zip() and zip(*) (see doc https://docs.python.org/3/library/functions.html)
- Parsing a json style column into multiple columns
- Merging different features into one feature set
- Find appropriate models and features to improve prediction accuracy
- Writing and debugging a lot of code
<br><br>

What to submit?
- The predictions of 5 models on the test set (see a sample submission sample_submission.csv). Diverify your portfolio, as similar models may suffer from similar problems.
- The notebook file (** please make sure that your code are sufficiently commented**)
- In the end of the notebook file, briefly describe what you have done, which models work the best, and what findings you have.
<br><br>

The top 50% submissions will get 0-3 extra points. Try at least 3 models for each question. Try as many as you want for extra credit.
<br><br>
** Please do NOT distribute the dataset used in this assignment!**


In [777]:
import pandas as pd
import os
os.chdir('C:/Users/Karpagam/Documents/Python Scripts/PythonAssignment')

In [881]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')


test.head()

Unnamed: 0,username,status,description
0,nazrulmadina,"{u'payment_verified': False, u'identity_verifi...",I am one of Self-employed person having more t...
1,SehrishWarraich,"{u'payment_verified': False, u'identity_verifi...",i am sehrish warraich.I do my job sincerelly a...
2,samadhinie,"{u'payment_verified': False, u'identity_verifi...","Since 2006, Web based solution provider (Web e..."
3,ebottabi,"{u'payment_verified': False, u'identity_verifi...",Founder of a geolocation service developed on...
4,mrjimoy,"{u'payment_verified': False, u'identity_verifi...","Me? I am Jimmy, meaning honest, courageous, or..."


## 1. Predicting Gender with Username
Some potential features of usernames: whether it has capital letters, whether it has digits, number of characters, number of vowels, first and last letters, etc. See http://www.nltk.org/book/ch06.html for some related code.

In [801]:
#extracting the gender features from username
import re
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    features["count"]= len(name)
    features["ifalphanumeric"]=name.isalnum()
    features["hasupper"]=name.isupper()
    vowels = ('a','e','i','o','u','A','E','I','O','U')
    if(re.sub('[^A-Za-z0-9]+', '', name)):
        features['endswithvowel'] = (re.sub(r'[0-9]+', '', name)).endswith(vowels)
  

    return features

In [802]:
username_list=list(zip(train['username'],train['gender']))

In [803]:
#creating the featurelists 
featuresets = [(gender_features2(n), gender) for (n,gender) in username_list]



In [804]:
#Naive Bayes Classifier model
import nltk
from sklearn.model_selection import KFold
import numpy as np
k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(featuresets):
    train_1 = [featuresets[i] for i in train_idx]
    test_1 = [featuresets[i] for i in test_idx]
    classifier_nb = nltk.NaiveBayesClassifier.train(train_1)   
    accu.append( nltk.classify.util.accuracy(classifier_nb, test_1) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu))    

accuracy: 0.8144
accuracy: 0.8056
accuracy: 0.8104
accuracy: 0.8248
accuracy: 0.7982385908726981
CV mean accuracy: 0.810687718175


In [811]:
#Linear SVM Classifier model

from nltk.classify import SklearnClassifier
from sklearn.svm import SVC
k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(featuresets):
    train1 = [featuresets[i] for i in train_idx]
    test1 = [featuresets[i] for i in test_idx]
    classifier_sk = SklearnClassifier(SVC(kernel='linear', C=10, random_state=1), sparse=True).train(train1)       
    accu.append( nltk.classify.util.accuracy(classifier_sk, test1) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu)) 

accuracy: 0.8224
accuracy: 0.8232
accuracy: 0.8088
accuracy: 0.8024
accuracy: 0.8078462770216173
CV mean accuracy: 0.812929255404


In [961]:
#Maximum Entropy classification

k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(featuresets):
    train_1 = [featuresets[i] for i in train_idx]
    test_1 = [featuresets[i] for i in test_idx]
    classifier_me = nltk.classify.MaxentClassifier.train(train_1, trace=3, max_iter=1)       
    accu.append( nltk.classify.util.accuracy(classifier_me, test_1) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu)) 

  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.809
         Final          -0.36459        0.809
accuracy: 0.8304
  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.815
         Final          -0.35358        0.815
accuracy: 0.8032
  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.816
         Final          -0.35122        0.816
accuracy: 0.7992
  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.815
         Final          -0.35422        0.815
accuracy: 0.8064
  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accu

In [842]:
#Getting the most important features from Naive Bayes classifier 

classifier_nb.show_most_informative_features(20)

Most Informative Features
            first_letter = 'q'                 F : M      =      3.6 : 1.0
                   count = 3                   F : M      =      2.5 : 1.0
             last_letter = 'd'                 M : F      =      2.3 : 1.0
            first_letter = 'l'                 F : M      =      2.1 : 1.0
             last_letter = 'v'                 M : F      =      2.0 : 1.0
             last_letter = 'm'                 M : F      =      2.0 : 1.0
             last_letter = 'x'                 M : F      =      1.9 : 1.0
             last_letter = 'a'                 F : M      =      1.8 : 1.0
            first_letter = 'c'                 F : M      =      1.8 : 1.0
            first_letter = 'f'                 M : F      =      1.7 : 1.0
             last_letter = 'h'                 M : F      =      1.6 : 1.0
            first_letter = 'v'                 M : F      =      1.6 : 1.0
            first_letter = 'u'                 M : F      =      1.5 : 1.0

In [807]:
nltk.classify.util.accuracy(classifier_nb, train1)

0.8094

In [818]:
nltk.classify.util.accuracy(classifier_sk, train1)

0.8142

In [962]:
nltk.classify.util.accuracy(classifier_me,train1)

0.8142

In [966]:
pred_uname = []
for i in range(0,len(test)):
    res=(classifier_nb.classify(gender_features2(test['username'][i])))
    pred_uname.append(res)

In [967]:
# support your predictions are stored in a list named pred_uname
zz = pd.DataFrame({'username':test['username'], 'prediction':pred_uname})
zz.to_csv('pred_uname.csv', index=False)

## 2. Predicting Gender with Description
The updated notebook for lecture 11 might be of some help, which now includes demo code for making predictions with NLTK classifier.

In [789]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
ps = PorterStemmer()
from nltk.tokenize import word_tokenize
def preprocess(text):
    return [ps.stem(w) for w in word_tokenize(text.lower()) 
             if w not in string.punctuation and w not in stopwords.words('english')] 

In [819]:
def extract_features(words, selected_words):
    ''' simply using words counts'''
    return nltk.FreqDist([w for w in words if w in selected_words])

In [820]:
train.head()

Unnamed: 0,username,gender,status,description
0,Vimal20011,M,"{u'payment_verified': False, u'identity_verifi...",A team of 5 working on various projects relate...
1,sheom,M,"{u'payment_verified': True, u'identity_verifie...",We are an IT solution and service provider com...
2,ezbik,M,"{u'payment_verified': False, u'identity_verifi...",System administration is my work & hobby.
3,angelme,F,"{u'payment_verified': False, u'identity_verifi...",Good day! Thank you for taking some time to ch...
4,snitch1,M,"{u'payment_verified': False, u'identity_verifi...",I build good relation with clients and deliver...


In [821]:
all_words=[]
for i in range(0,len(train)):
    result= preprocess(train.description[i])
    all_words.append(result)


In [822]:
import itertools
all_words_list= list(itertools.chain.from_iterable(all_words))

In [823]:
desc_words= list(zip(all_words,train.gender))

desc_df=pd.DataFrame(desc_words)
desc_df.head()

Unnamed: 0,0,1
0,"[team, 5, work, variou, project, relat, data, ...",M
1,"[solut, servic, provid, compani, expertis, e-l...",M
2,"[system, administr, work, hobbi]",M
3,"[good, day, thank, take, time, check, profil, ...",F
4,"[build, good, relat, client, deliv, high, qual...",M


In [875]:
desc_words[1]

(['solut',
  'servic',
  'provid',
  'compani',
  'expertis',
  'e-learn',
  'social',
  'media',
  'marketing.pleas',
  'visit',
  'compani',
  'url',
  'know'],
 'M')

In [824]:
words_freq = nltk.FreqDist(all_words_list)
words_freq['team']

798

In [825]:
selected_words = [word for word, freq in words_freq.items() if freq>1]
print('Before:',len(words_freq), ', after:', len(selected_words))

Before: 36128 , after: 9953


In [879]:
selected_words[1]

'5'

In [826]:
feat = [(extract_features(words,selected_words), c) for words, c in desc_words]
feat[25]

(FreqDist({'hard': 1,
           'honest': 1,
           'laxmi': 1,
           'name': 2,
           'road': 1,
           'sharma': 1,
           'sumit': 1,
           'us': 1,
           'worker': 1}),
 'M')

# Cross-Validation Naive Bayes Classifier

In [952]:

from sklearn.model_selection import KFold
import numpy as np
k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(feat):
    train_2 = [feat[i] for i in train_idx]
    test_2 = [feat[i] for i in test_idx]
    classifier2_nb = nltk.NaiveBayesClassifier.train(train_2)   
    accu.append( nltk.classify.util.accuracy(classifier2_nb, test_2) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu))    

accuracy: 0.5512
accuracy: 0.5528
accuracy: 0.596
accuracy: 0.548
accuracy: 0.5612489991993594
CV mean accuracy: 0.56184979984


# Classifiers from scikit learn

In [843]:

from nltk.classify import SklearnClassifier
from sklearn.svm import SVC
k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(feat):
    train_2 = [feat[i] for i in train_idx]
    test_2 = [feat[i] for i in test_idx]
    classifier2_sk = SklearnClassifier(SVC(kernel='linear', C=10, random_state=1), sparse=True).train(train_2)       
    accu.append( nltk.classify.util.accuracy(classifier2_sk, test_2) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu)) 

accuracy: 0.7432
accuracy: 0.7352
accuracy: 0.7352
accuracy: 0.7512
accuracy: 0.7349879903923139
CV mean accuracy: 0.739957598078


# Maximum Entropy Classifier

In [827]:
k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(feat):
    train_2 = [feat[i] for i in train_idx]
    test_2 = [feat[i] for i in test_idx]
    classifier2_me = nltk.classify.MaxentClassifier.train(train_2, trace=3, max_iter=1)       
    accu.append( nltk.classify.util.accuracy(classifier2_me, test_2) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu)) 

  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.816
         Final          -0.38001        0.808
accuracy: 0.7776
  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.813
         Final          -0.37663        0.801
accuracy: 0.7984
  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.811
         Final          -0.34920        0.806
accuracy: 0.8176
  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.813
         Final          -0.35734        0.802
accuracy: 0.8032
  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accu

# Gender Prediction with description - test data


In [852]:
all_words_test=[]
for i in range(0,len(test)):
    result= preprocess(test.description[i])
    all_words_test.append(result)



In [853]:
import itertools
all_words_list_test= list(itertools.chain.from_iterable(all_words_test))

In [889]:
desc_words_test= list(all_words_test)

desc_words_test[1]

['sehrish',
 'warraich.i',
 'job',
 'sincerelli',
 'submit',
 'mission',
 'better',
 'better',
 'next',
 'time',
 'previou',
 'one.i',
 'experienc',
 'data',
 'entri',
 'job',
 'perfectly..']

In [890]:
desc_test_df=pd.DataFrame(desc_words_test)
desc_test_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1491,1492,1493,1494,1495,1496,1497,1498,1499,1500
0,one,self-employ,person,16,year,experi,comput,comput,program,softwar,...,,,,,,,,,,
1,sehrish,warraich.i,job,sincerelli,submit,mission,better,better,next,time,...,,,,,,,,,,
2,sinc,2006,web,base,solut,provid,web,engin,experienc,area,...,,,,,,,,,,
3,founder,geoloc,servic,develop,redi,django,restmq,node.j,,,...,,,,,,,,,,
4,jimmi,mean,honest,courag,origin,full,high,inspir,creativ,take,...,,,,,,,,,,


In [891]:

words_freq_test = nltk.FreqDist(all_words_list_test)

In [892]:
selected_words_test = [word for word, freq in words_freq_test.items() if freq>1]
print('Before:',len(words_freq_test), ', after:', len(selected_words_test))


selected_words_test[1]

Before: 20142 , after: 5820


'self-employ'

In [965]:
feat_test = [(extract_features(words1,selected_words_test)) for words1 in desc_words_test]
feat_test[1]


FreqDist({'better': 2,
          'data': 1,
          'entri': 1,
          'experienc': 1,
          'job': 2,
          'mission': 1,
          'next': 1,
          'one.i': 1,
          'previou': 1,
          'submit': 1,
          'time': 1})

In [894]:
pred1 = [classifier2_sk.classify(row) for row in feat_test]


In [898]:
pred2=[classifier2_me.classify(row) for row in feat_test]

In [900]:
zz1 = pd.DataFrame({'username':test['username'], 'prediction':pred1})
zz1.to_csv('pred1_uname.csv', index=False)

In [901]:
zz2 = pd.DataFrame({'username':test['username'], 'prediction':pred2})
zz2.to_csv('pred2_uname.csv', index=False)

## 3. Predicting Gender with Username, Description, and Status
If you need to merge multiple dict-format features into one, check the following question: https://stackoverflow.com/questions/38987/how-to-merge-two-dictionaries-in-a-single-expression

In [882]:
train = pd.read_csv('train.csv')

# Parse Json format status as dictionary
from ast import literal_eval
#status = train['status'].apply(literal_eval)

In [883]:
train.head()

Unnamed: 0,username,gender,status,description
0,Vimal20011,M,"{u'payment_verified': False, u'identity_verifi...",A team of 5 working on various projects relate...
1,sheom,M,"{u'payment_verified': True, u'identity_verifie...",We are an IT solution and service provider com...
2,ezbik,M,"{u'payment_verified': False, u'identity_verifi...",System administration is my work & hobby.
3,angelme,F,"{u'payment_verified': False, u'identity_verifi...",Good day! Thank you for taking some time to ch...
4,snitch1,M,"{u'payment_verified': False, u'identity_verifi...",I build good relation with clients and deliver...


In [884]:
train.status = train['status'].apply(literal_eval)

train.head()

Unnamed: 0,username,gender,status,description
0,Vimal20011,M,"{'payment_verified': False, 'identity_verified...",A team of 5 working on various projects relate...
1,sheom,M,"{'payment_verified': True, 'identity_verified'...",We are an IT solution and service provider com...
2,ezbik,M,"{'payment_verified': False, 'identity_verified...",System administration is my work & hobby.
3,angelme,F,"{'payment_verified': False, 'identity_verified...",Good day! Thank you for taking some time to ch...
4,snitch1,M,"{'payment_verified': False, 'identity_verified...",I build good relation with clients and deliver...


In [885]:
train=pd.concat([train.drop(['status'], axis=1), train['status'].apply(pd.Series)], axis=1)


In [886]:
train.head()

Unnamed: 0,username,gender,description,deposit_made,email_verified,facebook_connected,identity_verified,payment_verified,phone_verified,profile_complete
0,Vimal20011,M,A team of 5 working on various projects relate...,True,True,False,False,False,False,True
1,sheom,M,We are an IT solution and service provider com...,True,True,True,False,True,False,True
2,ezbik,M,System administration is my work & hobby.,False,True,False,False,False,False,True
3,angelme,F,Good day! Thank you for taking some time to ch...,True,True,False,False,False,True,True
4,snitch1,M,I build good relation with clients and deliver...,False,True,False,False,False,False,True


In [887]:
all_words=[]
for i in range(0,len(train)):
    result= preprocess(train.description[i])
    all_words.append(result)


In [904]:
import itertools
all_words_list= list(itertools.chain.from_iterable(all_words))

In [905]:
all_words_list[0]

'team'

In [906]:
gender_f=[]
for i in range(0,len(train)):
    a=gender_features2(train.username[i])
    gender_f.append(a)





In [907]:
gender_df=pd.DataFrame(gender_f)

In [908]:
username_count=gender_df['count']
username_endswithvowel=gender_df.endswithvowel
username_first_letter=gender_df.first_letter
username_hasupper=gender_df.hasupper
username_ifalphanumeric=gender_df.ifalphanumeric
username_last_letter=gender_df.last_letter

In [910]:
desc_words= list(zip(all_words,username_count,username_endswithvowel,username_first_letter,username_hasupper,username_ifalphanumeric,username_last_letter,train.deposit_made,train.email_verified,train.facebook_connected,
train.identity_verified,train.payment_verified,train.phone_verified,train.profile_complete,train.gender))

In [911]:
desc_df=pd.DataFrame(desc_words)
desc_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,"[team, 5, work, variou, project, relat, data, ...",10,False,v,False,True,1,True,True,False,False,False,False,True,M
1,"[solut, servic, provid, compani, expertis, e-l...",5,False,s,False,True,m,True,True,True,False,True,False,True,M
2,"[system, administr, work, hobbi]",5,False,e,False,True,k,False,True,False,False,False,False,True,M
3,"[good, day, thank, take, time, check, profil, ...",7,True,a,False,True,e,True,True,False,False,False,True,True,F
4,"[build, good, relat, client, deliv, high, qual...",7,False,s,False,True,1,False,True,False,False,False,False,True,M


In [912]:
words_freq = nltk.FreqDist(all_words_list)

In [913]:
selected_words = [word for word, freq in words_freq.items() if freq>1]
print('Before:',len(words_freq), ', after:', len(selected_words))

Before: 36128 , after: 9953


In [914]:
def extract_features_all(words, selected_words,a,b,c,d,e,f,g,h,i,j,k,l,m):
    ''' simply using words counts'''
    feature={}
    fdist= nltk.FreqDist([w for w in words if w in selected_words])
    feature['username_count']=a
    feature['endswithvowel']=b
    feature['first_letter']=c
    feature['hasupper']=d
    feature['ifalphanumeric']=e
    feature['last_letter']=f
    feature['deposit_made']=g
    feature['email_verified']=h
    feature['facebook_connected']=i
    feature['identity_verified']=j
    feature['payment_verified']=k
    feature['phone_verified']=l
    feature['profile_complete']=m
    
    feature1=fdist
    x=feature.update(feature1)
    return feature



In [916]:
feat_all = [(extract_features_all(words,selected_words, a,b,c,d,e,f,g,h,i,j,k,l,m),gender) for (words,a,b,c,d,e,f,g,h,i,j,k,l,m,gender) in desc_words]


In [917]:
feat_all[2]


({'administr': 1,
  'deposit_made': False,
  'email_verified': True,
  'endswithvowel': False,
  'facebook_connected': False,
  'first_letter': 'e',
  'hasupper': False,
  'hobbi': 1,
  'identity_verified': False,
  'ifalphanumeric': True,
  'last_letter': 'k',
  'payment_verified': False,
  'phone_verified': False,
  'profile_complete': True,
  'system': 1,
  'username_count': 5,
  'work': 1},
 'M')

In [950]:
from sklearn.model_selection import KFold
import numpy as np
k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(feat_all):
    train_3= [feat_all[i] for i in train_idx]
    test_3 = [feat_all[i] for i in test_idx]
    classifier3_nb = nltk.NaiveBayesClassifier.train(train_3)   
    accu.append( nltk.classify.util.accuracy(classifier3_nb, test_3) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu))    

accuracy: 0.5616
accuracy: 0.5912
accuracy: 0.588
accuracy: 0.5608
accuracy: 0.5708566853482786
CV mean accuracy: 0.57449133707


In [959]:
from nltk.classify import SklearnClassifier
from sklearn.svm import SVC
k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(feat_all):
    train_3 = [feat_all[i] for i in train_idx]
    test_3 = [feat_all[i] for i in test_idx]
    classifier3_sk = SklearnClassifier(SVC(kernel='linear', C=10, random_state=1), sparse=True).train(train_3)       
    accu.append( nltk.classify.util.accuracy(classifier3_sk, test_3) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu)) 

accuracy: 0.7312
accuracy: 0.7168
accuracy: 0.7328
accuracy: 0.736
accuracy: 0.7493995196156925
CV mean accuracy: 0.733239903923


In [955]:
k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(feat_all):
    train_3 = [feat_all[i] for i in train_idx]
    test_3 = [feat_all[i] for i in test_idx]
    classifier3_me = nltk.classify.MaxentClassifier.train(train_3, trace=3, max_iter=1)       
    accu.append( nltk.classify.util.accuracy(classifier3_me, test_3) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu)) 

  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.811
      Training stopped: keyboard interrupt
         Final          -0.69315        0.811
accuracy: 0.8192
  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.815
         Final          -0.29113        0.816
accuracy: 0.804
  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.816
         Final          -0.24521        0.816
accuracy: 0.8024
  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.814
         Final          -0.29265        0.815
accuracy: 0.8088
  ==> Training (1 iterations)


# Gender prediction with username, status and description for test data


In [931]:
test = pd.read_csv('test.csv')

from ast import literal_eval
#status = test['status'].apply(literal_eval)


In [932]:
test.status = test['status'].apply(literal_eval)
test.head()



Unnamed: 0,username,status,description
0,nazrulmadina,"{'payment_verified': False, 'identity_verified...",I am one of Self-employed person having more t...
1,SehrishWarraich,"{'payment_verified': False, 'identity_verified...",i am sehrish warraich.I do my job sincerelly a...
2,samadhinie,"{'payment_verified': False, 'identity_verified...","Since 2006, Web based solution provider (Web e..."
3,ebottabi,"{'payment_verified': False, 'identity_verified...",Founder of a geolocation service developed on...
4,mrjimoy,"{'payment_verified': False, 'identity_verified...","Me? I am Jimmy, meaning honest, courageous, or..."


In [933]:
test=pd.concat([test.drop(['status'], axis=1), test['status'].apply(pd.Series)], axis=1)

test.head()


Unnamed: 0,username,description,deposit_made,email_verified,facebook_connected,identity_verified,payment_verified,phone_verified,profile_complete
0,nazrulmadina,I am one of Self-employed person having more t...,False,True,False,False,False,False,True
1,SehrishWarraich,i am sehrish warraich.I do my job sincerelly a...,False,True,True,False,False,False,True
2,samadhinie,"Since 2006, Web based solution provider (Web e...",False,True,True,False,False,False,True
3,ebottabi,Founder of a geolocation service developed on...,False,True,False,False,False,False,True
4,mrjimoy,"Me? I am Jimmy, meaning honest, courageous, or...",False,True,True,False,False,True,True


In [934]:
all_words=[]
for i in range(0,len(test)):
    result= preprocess(test.description[i])
    all_words.append(result)

In [935]:
import itertools
all_words_list= list(itertools.chain.from_iterable(all_words))

In [936]:
all_words_list[0]

'one'

In [937]:
gender_f_test=[]
for i in range(0,len(test)):
    a=gender_features2(test.username[i])
    gender_f_test.append(a)

In [938]:
gender_df_test=pd.DataFrame(gender_f_test)

In [939]:
usernamet_count=gender_df_test['count']
usernamet_endswithvowel=gender_df_test.endswithvowel
usernamet_first_letter=gender_df_test.first_letter
usernamet_hasupper=gender_df_test.hasupper
usernamet_ifalphanumeric=gender_df_test.ifalphanumeric
usernamet_last_letter=gender_df_test.last_letter


In [940]:
desct_words= list(zip(all_words,usernamet_count,usernamet_endswithvowel,usernamet_first_letter,usernamet_hasupper,usernamet_ifalphanumeric,usernamet_last_letter,test.deposit_made,test.email_verified,test.facebook_connected,
test.identity_verified,test.payment_verified,test.phone_verified,test.profile_complete))

In [941]:
desc_df_t=pd.DataFrame(desct_words)
desc_df_t.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,"[one, self-employ, person, 16, year, experi, c...",12,True,n,False,True,a,False,True,False,False,False,False,True
1,"[sehrish, warraich.i, job, sincerelli, submit,...",15,False,s,False,True,h,False,True,True,False,False,False,True
2,"[sinc, 2006, web, base, solut, provid, web, en...",10,True,s,False,True,e,False,True,True,False,False,False,True
3,"[founder, geoloc, servic, develop, redi, djang...",8,True,e,False,True,i,False,True,False,False,False,False,True
4,"[jimmi, mean, honest, courag, origin, full, hi...",7,False,m,False,True,y,False,True,True,False,False,True,True


In [942]:
words_freqt = nltk.FreqDist(all_words_list)

In [943]:

selected_wordst = [word for word, freq in words_freqt.items() if freq>1]
print('Before:',len(words_freqt), ', after:', len(selected_wordst))


Before: 20142 , after: 5820


In [944]:
feat_all_test = [(extract_features_all(words,selected_words, a,b,c,d,e,f,g,h,i,j,k,l,m)) for (words,a,b,c,d,e,f,g,h,i,j,k,l,m) in desct_words]


In [945]:
feat_all_test[1]

{'better': 2,
 'data': 1,
 'deposit_made': False,
 'email_verified': True,
 'endswithvowel': False,
 'entri': 1,
 'experienc': 1,
 'facebook_connected': True,
 'first_letter': 's',
 'hasupper': False,
 'identity_verified': False,
 'ifalphanumeric': True,
 'job': 2,
 'last_letter': 'h',
 'mission': 1,
 'next': 1,
 'payment_verified': False,
 'phone_verified': False,
 'previou': 1,
 'profile_complete': True,
 'submit': 1,
 'time': 1,
 'username_count': 15}

In [958]:
pred3 = [classifier3_sk.classify(row) for row in feat_all_test]


In [960]:
zz3 = pd.DataFrame({'username':test['username'], 'prediction':pred3})
zz3.to_csv('pred3_uname.csv', index=False)

In [948]:
pred4=[classifier3_me.classify(row) for row in feat_all_test]


In [949]:
zz4 = pd.DataFrame({'username':test['username'], 'prediction':pred4})
zz4.to_csv('pred4_uname.csv', index=False)

### Extra Credit: Try Different Features and Models for Best Performance
Save your predictions as netid_1.csv, ..., netid_5.csv

In [None]:
In the first question, I extracted gender_features like if the username is alphanumeric, has a vowel, if it is in lower case
or not and few other features like first and last letter. 

The important features are first, last and count for the Naive bayes model and gave a model accuracy of 0.8104

In the second question, I extracted features based on description after stemming and splitting the description into set 
of words and also calculated the count of words occurence to train the model

Maximum Entropy and Linear Support Vector Machine model gave a model accuracy of 0.800 and 0.7399

In the third question, I included status features along with description and gender features and the maximum entropy and 
linear support vector machine gave 0.833 and 0.733




