# Assignment 5. Machine Learning and Natural Language Processing

OPIM 5894 Data Science with Python

Name:Deepti Anie Varghese   NetID:dav16108

Discussed with: if any

## Instructions
In this assignment, you are asked to predict genders of users using their public information on websites. In question 1, you are asked to predict gender using only usename. In question 2, you are asked to predict gender using the profile description of a user instead. Finally, you may combine all available information of users to make predictions. You may explore different models and different combination of features, as well as different ways to transform features, to achieve best performance. 
<br> <br>
- It is recommended to use NLTK for this classification task, as the features stored in dictionary style can be easily extended. While scikit-learn is easier for Q2, it might not be that straightforward to combine different features in Q3. In addition, dealing with categorical variables can be a pain in scikit-learn. If you plan to use scikit-learn anyway, please read the following post: http://pbpython.com/categorical-encoding.html
- While protyping, it is easier to stick to the Naive Bayes Classifier. Adding other classifiers once your code is bug-free.
- Use cross validation on the training set to avoid over-fitting, though it is not guaranteed achieve that purpose.


<br>
This assignment involves the following challenges:
- Construct features from strings (i.e., usernames)
- Frequent use of zip() and zip(*) (see doc https://docs.python.org/3/library/functions.html)
- Parsing a json style column into multiple columns
- Merging different features into one feature set
- Find appropriate models and features to improve prediction accuracy
- Writing and debugging a lot of code
<br><br>

What to submit?
- The predictions of 5 models on the test set (see a sample submission sample_submission.csv). Diverify your portfolio, as similar models may suffer from similar problems.
- The notebook file (** please make sure that your code are sufficiently commented**)
- In the end of the notebook file, briefly describe what you have done, which models work the best, and what findings you have.
<br><br>

The top 50% submissions will get 0-3 extra points. Try at least 3 models for each question. Try as many as you want for extra credit.
<br><br>
** Please do NOT distribute the dataset used in this assignment!**


In [96]:
import nltk
import pandas as pd
import os
os.chdir('C:/Users/deept/Desktop/Fall 2017/DataScience with Python/Nov16/Assignmnent5')

In [97]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')




## 1. Predicting Gender with Username
Some potential features of usernames: whether it has capital letters, whether it has digits, number of characters, number of vowels, first and last letters, etc. See http://www.nltk.org/book/ch06.html for some related code.

In [98]:
# Function to extract various features from name like presence of capital letters, digits, vowels, length of names, first and 
# last characters from names using Regular Expression

import re
def usernamefeat(names):
    features={'Caps':re.search('[A-Z]',names),                # Checking presence of capital letters using regular expression
              'Digit':re.search('\d',names),                  # Checking presence of digits using regular expression
              'Numchar':len(names),                           # Calculating length of names
              'Vowel':re.search('[aeiou]', names,flags=re.I), # Checking presence of vowels using regular expression
              'First':names[0],                               # Extracting first character of name
               'Last':names[-1]}                              # Extracting last character of name
    return features


In [99]:
username1=train['username']
gender1=train['gender']
usergender=pd.concat([username1,gender1],axis=1)


In [100]:
# Create a list of tuples containing username and gender - this is the training set

trainfeat=[(usernamefeat(username),gender) for username,gender in usergender.values]
trainfeat

[({'Caps': <_sre.SRE_Match object; span=(0, 1), match='V'>,
   'Digit': <_sre.SRE_Match object; span=(5, 6), match='2'>,
   'First': 'V',
   'Last': '1',
   'Numchar': 10,
   'Vowel': <_sre.SRE_Match object; span=(1, 2), match='i'>},
  'M'),
 ({'Caps': None,
   'Digit': None,
   'First': 's',
   'Last': 'm',
   'Numchar': 5,
   'Vowel': <_sre.SRE_Match object; span=(2, 3), match='e'>},
  'M'),
 ({'Caps': None,
   'Digit': None,
   'First': 'e',
   'Last': 'k',
   'Numchar': 5,
   'Vowel': <_sre.SRE_Match object; span=(0, 1), match='e'>},
  'M'),
 ({'Caps': None,
   'Digit': None,
   'First': 'a',
   'Last': 'e',
   'Numchar': 7,
   'Vowel': <_sre.SRE_Match object; span=(0, 1), match='a'>},
  'F'),
 ({'Caps': None,
   'Digit': <_sre.SRE_Match object; span=(6, 7), match='1'>,
   'First': 's',
   'Last': '1',
   'Numchar': 7,
   'Vowel': <_sre.SRE_Match object; span=(2, 3), match='i'>},
  'M'),
 ({'Caps': None,
   'Digit': None,
   'First': 'e',
   'Last': 't',
   'Numchar': 14,
   'Vowel

In [101]:
# Run a Naive Bayes classifier on above training set using 5 fold cross validation and check accuracy of model

from sklearn.model_selection import KFold
import numpy as np
k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(trainfeat):
    train = [trainfeat[i] for i in train_idx]
    test = [trainfeat[i] for i in test_idx]
    classifier = nltk.NaiveBayesClassifier.train(train)   
    accu.append( nltk.classify.util.accuracy(classifier, test) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu))   

accuracy: 0.7168
accuracy: 0.6952
accuracy: 0.7208
accuracy: 0.7168
accuracy: 0.722177742193755
CV mean accuracy: 0.714355548439


In [102]:
# Display top 5 most informative features used for classification

classifier.show_most_informative_features(5)

Most Informative Features
                   First = 'U'                 F : M      =      7.2 : 1.0
                    Last = 'O'                 F : M      =      7.2 : 1.0
                   First = 'X'                 F : M      =      4.3 : 1.0
                   First = 'Y'                 F : M      =      3.5 : 1.0
                    Last = 'B'                 F : M      =      3.5 : 1.0


In [103]:
#Train Naive Bayes classifier on entire training set and use it to predict gender on test set

test = pd.read_csv('test.csv')
username1=test['username']
testfeat=[usernamefeat(username) for username in username1]   # Creation of test set
classifier2 = nltk.NaiveBayesClassifier.train(trainfeat)  
pred = [classifier2.classify(row) for row in testfeat]        # Predicting output of test set 
pred

['M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'F',
 'F',
 'M',
 'M',
 'F',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'F',
 'M',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'F',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'F',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'F',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'F',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M'

In [104]:
# Copy test set predictions to output dataset

naiveqn1 = pd.DataFrame({'username':test['username'], 'prediction':pred})
naiveqn1.to_csv('dav16108naiveqn1.csv', index=False)

In [105]:
# Run a Max Entropy classifier on above training set using 5 fold cross validation and check accuracy of model


k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(trainfeat):
    train = [trainfeat[i] for i in train_idx]
    test = [trainfeat[i] for i in test_idx]
    classifier = nltk.classify.MaxentClassifier.train(trainfeat, trace=3, max_iter=5)       
    accu.append( nltk.classify.util.accuracy(classifier, test) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu)) 

  ==> Training (5 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.813
             2          -0.33608        0.813
             3          -0.30253        0.819
             4          -0.27509        0.836
         Final          -0.25208        0.868
accuracy: 0.868
  ==> Training (5 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.813
             2          -0.33608        0.813
             3          -0.30253        0.819
             4          -0.27509        0.836
         Final          -0.25208        0.868
accuracy: 0.8632
  ==> Training (5 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.813
             2          -0.33608        0.813
             3          -0.30253        0.819


In [106]:
# Display top 5 most informative features used for classification

classifier.show_most_informative_features(5)

   1.468 Vowel==<_sre.SRE_Match object; span=(2, 3), match='a'> and label is 'F'
   1.401 Vowel==<_sre.SRE_Match object; span=(0, 1), match='i'> and label is 'F'
   1.398 Vowel==<_sre.SRE_Match object; span=(1, 2), match='o'> and label is 'F'
   1.390 Vowel==<_sre.SRE_Match object; span=(1, 2), match='e'> and label is 'F'
   1.387 Vowel==<_sre.SRE_Match object; span=(0, 1), match='u'> and label is 'F'


In [107]:
#Train Max Entropy classifier on entire training set and use it to predict gender on test set


test3 = pd.read_csv('test.csv')
username3=test3['username']
testfeat=[usernamefeat(username) for username in username3]
classifier3 = nltk.classify.MaxentClassifier.train(trainfeat, trace=3, max_iter=5)  
pred3 = [classifier3.classify(row) for row in testfeat]
pred3

  ==> Training (5 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.813
             2          -0.33608        0.813
             3          -0.30253        0.819
             4          -0.27509        0.836
         Final          -0.25208        0.868


['M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M'

In [108]:
# Copy test set predictions to output dataset

maxentropyqn1 = pd.DataFrame({'username':test3['username'], 'prediction':pred3})
maxentropyqn1.to_csv('dav16108maxentropyqn1.csv', index=False)

In [109]:
# support your predictions are stored in a list named pred_uname
# zz = pd.DataFrame({'username':test['username'], 'prediction':pred_uname})
# zz.to_csv('pred_uname.csv', index=False)

## 2. Predicting Gender with Description
The updated notebook for lecture 11 might be of some help, which now includes demo code for making predictions with NLTK classifier.

In [110]:
#Sentence preprocessing

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
ps = PorterStemmer()
from nltk.tokenize import word_tokenize
def preprocess(text):
    return [ps.stem(w) for w in word_tokenize(text.lower()) 
             if w not in string.punctuation and w not in stopwords.words('english')] 

In [111]:
#Extracting word counts from sentences
def extract_features(words, selected_words):
    ''' simply using words counts'''
    return nltk.FreqDist([w for w in words if w in selected_words])

In [112]:
# Read train and test data sets

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [113]:
# Extract gender and description and from training set

gender1=train['gender']
desc1=train['description']
userdesc1=pd.concat([gender1,desc1],axis=1)

In [114]:
# Create a list of tuples containing relevant words from description and gender - this is the training set

all_words = [(gender,preprocess(description)) for gender,description in userdesc1.values ]
all_words2 = [w for gender, description in all_words for w in description]  #Ungroup or flatten and convert into 1 set of words
words_freq = nltk.FreqDist(all_words2)
selected_words = [word for word, freq in words_freq.items() if freq>1] #Word count greater than 1 in whole set of words
trainfeat2=[(extract_features(desc,selected_words),gender) for gender,desc in all_words]# retain only those words in each group where word count>!
trainfeat2

[(FreqDist({'5': 1,
            'content': 1,
            'data': 1,
            'entri': 1,
            'project': 1,
            'relat': 1,
            'research': 1,
            'team': 1,
            'variou': 1,
            'work': 2,
            'write': 1}),
  'M'),
 (FreqDist({'compani': 2,
            'e-learn': 1,
            'expertis': 1,
            'know': 1,
            'media': 1,
            'provid': 1,
            'servic': 1,
            'social': 1,
            'solut': 1,
            'url': 1,
            'visit': 1}),
  'M'),
 (FreqDist({'administr': 1, 'hobbi': 1, 'system': 1, 'work': 1}), 'M'),
 (FreqDist({'acquir': 1,
            'articl': 1,
            'check': 1,
            'content': 1,
            'copywrit': 1,
            'day': 1,
            'experi': 2,
            'experienc': 1,
            'good': 1,
            'hope': 1,
            'knowledg': 1,
            'offer': 1,
            'profil': 1,
            'promis': 1,
            'qualiti': 

In [115]:
# Run a Naive Bayes classifier on above training set using 5 fold cross validation and check accuracy of model

from sklearn.model_selection import KFold
import numpy as np
k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(trainfeat2):
    train = [trainfeat2[i] for i in train_idx]
    test = [trainfeat2[i] for i in test_idx]
    classifier = nltk.NaiveBayesClassifier.train(train)   
    accu.append( nltk.classify.util.accuracy(classifier, test) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu))   

accuracy: 0.5736
accuracy: 0.5504
accuracy: 0.54
accuracy: 0.5888
accuracy: 0.5724579663730984
CV mean accuracy: 0.565051593275


In [116]:
# Display top 5 most informative features used for classification

classifier.show_most_informative_features(5)

Most Informative Features
                   linux = 1                   M : F      =     17.3 : 1.0
                     mom = 1                   F : M      =     17.0 : 1.0
                     cat = 1                   F : M      =     16.2 : 1.0
                 typeset = 1                   F : M      =     16.2 : 1.0
                     cum = 1                   F : M      =     15.0 : 1.0


In [117]:
#Train Naive Bayes classifier on entire training set and use it to predict gender on test set
import pandas as pd

train = pd.read_csv('train.csv')
test4 = pd.read_csv('test.csv')
desc1=pd.DataFrame(test4['description'])
all_words = [preprocess(description) for description in desc1['description'] ]

all_words2 = [w for description in all_words for w in description]  #Ungroup or flatten and convert into 1 set of words
words_freq = nltk.FreqDist(all_words2)
selected_words = [word for word, freq in words_freq.items() if freq>1] #Word count greater than 1 in whole set of words
testfeat2=[extract_features(desc,selected_words) for desc in all_words]

classifier2 = nltk.NaiveBayesClassifier.train(trainfeat2)  
pred = [classifier2.classify(row) for row in testfeat2]
pred


['F',
 'F',
 'M',
 'M',
 'M',
 'M',
 'F',
 'F',
 'F',
 'M',
 'M',
 'M',
 'F',
 'F',
 'M',
 'M',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'F',
 'M',
 'F',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'F',
 'M',
 'F',
 'F',
 'M',
 'F',
 'M',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'F',
 'M',
 'M',
 'M',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'F',
 'M',
 'F',
 'F',
 'M',
 'F',
 'F',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'F',
 'F',
 'M',
 'M',
 'F',
 'F',
 'F',
 'M',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'F',
 'M',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'M'

In [118]:
# Copy test set predictions to output dataset

naiveqn2 = pd.DataFrame({'username':test4['username'], 'prediction':pred})
naiveqn2.to_csv('dav16108naiveqn2.csv', index=False)

In [119]:
# Run a Max Entropy classifier on above training set using 5 fold cross validation and check accuracy of model


k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(trainfeat2):
    train = [trainfeat2[i] for i in train_idx]
    test = [trainfeat2[i] for i in test_idx]
    classifier = nltk.classify.MaxentClassifier.train(train, trace=3, max_iter=2)       
    accu.append( nltk.classify.util.accuracy(classifier, test) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu)) 

  ==> Training (2 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.810
         Final          -0.62553        0.568
accuracy: 0.5544
  ==> Training (2 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.813
         Final          -0.33292        0.807
accuracy: 0.8016
  ==> Training (2 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.814
         Final          -0.33808        0.809
accuracy: 0.796
  ==> Training (2 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.814
         Final          -0.35016        0.810
accuracy: 0.8056
  ==> Training (2 iterations)

      Iteration    Log Likelihood    Accur

In [120]:
# Display top 5 most informative features used for classification

classifier.show_most_informative_features(5)

   0.500 lion==1 and label is 'M'
   0.500 exchangelyncwindow==1 and label is 'M'
   0.500 ciw==1 and label is 'M'
   0.433 build==1 and label is 'M'
   0.433 complet==1 and label is 'M'


In [121]:
#Train Max Entropy classifier on entire training set and use it to predict gender on test set


train = pd.read_csv('train.csv')
test5 = pd.read_csv('test.csv')
desc1=pd.DataFrame(test4['description'])
all_words = [preprocess(description) for description in desc1['description'] ]
all_words2 = [w for description in all_words for w in description]  #Ungroup or flatten and convert into 1 set of words
words_freq = nltk.FreqDist(all_words2)
selected_words = [word for word, freq in words_freq.items() if freq>1] #Word count greater than 1 in whole set of words
testfeat2=[extract_features(desc,selected_words) for desc in all_words]

classifier2 = nltk.classify.MaxentClassifier.train(trainfeat2, trace=3, max_iter=2)
pred = [classifier2.classify(row) for row in testfeat2]
pred

  ==> Training (2 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.813
         Final          -0.36920        0.804


['M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M'

In [122]:
# Copy test set predictions to output dataset

maxentropyqn2 = pd.DataFrame({'username':test5['username'], 'prediction':pred})
maxentropyqn2.to_csv('dav16108maxentropyqn2.csv', index=False)

In [123]:
# Run Support Vector classifier on above training set using 5 fold cross validation and check accuracy of model

from nltk.classify import SklearnClassifier
from sklearn.svm import SVC
k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(trainfeat2):
    train = [trainfeat2[i] for i in train_idx]
    test = [trainfeat2[i] for i in test_idx]
    classifier = SklearnClassifier(SVC(kernel='linear', C=10, random_state=1), sparse=True).train(train)       
    accu.append( nltk.classify.util.accuracy(classifier, test) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu)) 

accuracy: 0.752
accuracy: 0.7704
accuracy: 0.7312
accuracy: 0.7336
accuracy: 0.734187349879904
CV mean accuracy: 0.744277469976


In [124]:
#Train Support Vector classifier on entire training set and use it to predict gender on test set


svcclf = SklearnClassifier(SVC(kernel='linear', C=10, random_state=1), sparse=True).train(trainfeat2)
pred = [svcclf.classify(row) for row in testfeat2]
pred

['M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'F',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'F',
 'F',
 'M',
 'F',
 'M',
 'F',
 'M',
 'M',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'F',
 'M',
 'M',
 'F',
 'F',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M'

In [125]:
# Copy test set predictions to output dataset

svcqn2 = pd.DataFrame({'username':test5['username'], 'prediction':pred})
svcqn2.to_csv('dav16108svcqn2.csv', index=False)

## 3. Predicting Gender with Username, Description, and Status
If you need to merge multiple dict-format features into one, check the following question: https://stackoverflow.com/questions/38987/how-to-merge-two-dictionaries-in-a-single-expression

In [126]:
# Parse Json format status as dictionary
train=pd.read_csv('train.csv')
from ast import literal_eval
status = train['status'].apply(literal_eval)

In [127]:
# Now you need to find a way to split the dictionary format status as multiple columns


In [128]:
#Splitting dictionary format status to multiple columns and displaying

len(status)
df=pd.DataFrame()
for i in range(len(status)):
    data=pd.DataFrame(status[i],index=[i])
    df=df.append(data,ignore_index=True)
df
train=pd.concat([train,df],axis=1)
train


Unnamed: 0,username,gender,status,description,deposit_made,email_verified,facebook_connected,identity_verified,payment_verified,phone_verified,profile_complete
0,Vimal20011,M,"{u'payment_verified': False, u'identity_verifi...",A team of 5 working on various projects relate...,True,True,False,False,False,False,True
1,sheom,M,"{u'payment_verified': True, u'identity_verifie...",We are an IT solution and service provider com...,True,True,True,False,True,False,True
2,ezbik,M,"{u'payment_verified': False, u'identity_verifi...",System administration is my work & hobby.,False,True,False,False,False,False,True
3,angelme,F,"{u'payment_verified': False, u'identity_verifi...",Good day! Thank you for taking some time to ch...,True,True,False,False,False,True,True
4,snitch1,M,"{u'payment_verified': False, u'identity_verifi...",I build good relation with clients and deliver...,False,True,False,False,False,False,True
5,ehabdigitalart,M,"{u'payment_verified': False, u'identity_verifi...","Over the last 12 years, I have developed a wid...",False,True,True,False,False,False,True
6,laarniandbuboy,F,"{u'payment_verified': False, u'identity_verifi...",WORK EXPERIENCESTraining SupervisorDirect Resp...,False,True,False,False,False,False,True
7,payzone,M,"{u'payment_verified': False, u'identity_verifi...",i'm graduate from engineering chemical faculty...,False,True,False,False,False,False,True
8,istratebogdan,M,"{u'payment_verified': False, u'identity_verifi...",Name: Istrate Bogdan PetrusNationality: Romani...,False,True,True,False,False,False,True
9,pam2489,F,"{u'payment_verified': False, u'identity_verifi...",My goal is to create long-lasting relationship...,True,True,False,False,False,False,True


In [129]:
#Combining features username,description and status

fea={}
for i in range(len(status)):
    z={**trainfeat[i][0],**trainfeat2[i][0],**status[i]}
    fea[i]=z

fea


{0: {'5': 1,
  'Caps': <_sre.SRE_Match object; span=(0, 1), match='V'>,
  'Digit': <_sre.SRE_Match object; span=(5, 6), match='2'>,
  'First': 'V',
  'Last': '1',
  'Numchar': 10,
  'Vowel': <_sre.SRE_Match object; span=(1, 2), match='i'>,
  'content': 1,
  'data': 1,
  'deposit_made': True,
  'email_verified': True,
  'entri': 1,
  'facebook_connected': False,
  'identity_verified': False,
  'payment_verified': False,
  'phone_verified': False,
  'profile_complete': True,
  'project': 1,
  'relat': 1,
  'research': 1,
  'team': 1,
  'variou': 1,
  'work': 2,
  'write': 1},
 1: {'Caps': None,
  'Digit': None,
  'First': 's',
  'Last': 'm',
  'Numchar': 5,
  'Vowel': <_sre.SRE_Match object; span=(2, 3), match='e'>,
  'compani': 2,
  'deposit_made': True,
  'e-learn': 1,
  'email_verified': True,
  'expertis': 1,
  'facebook_connected': True,
  'identity_verified': False,
  'know': 1,
  'media': 1,
  'payment_verified': True,
  'phone_verified': False,
  'profile_complete': True,
  'prov

In [130]:
#Making training set in the form of tuple of features,gender-->list2 is training set
gender1=train['gender']
list2=[]
for i in range(len(status)):
    z2=(fea[i],gender1[i])
    list2.append(z2)

list2

[({'5': 1,
   'Caps': <_sre.SRE_Match object; span=(0, 1), match='V'>,
   'Digit': <_sre.SRE_Match object; span=(5, 6), match='2'>,
   'First': 'V',
   'Last': '1',
   'Numchar': 10,
   'Vowel': <_sre.SRE_Match object; span=(1, 2), match='i'>,
   'content': 1,
   'data': 1,
   'deposit_made': True,
   'email_verified': True,
   'entri': 1,
   'facebook_connected': False,
   'identity_verified': False,
   'payment_verified': False,
   'phone_verified': False,
   'profile_complete': True,
   'project': 1,
   'relat': 1,
   'research': 1,
   'team': 1,
   'variou': 1,
   'work': 2,
   'write': 1},
  'M'),
 ({'Caps': None,
   'Digit': None,
   'First': 's',
   'Last': 'm',
   'Numchar': 5,
   'Vowel': <_sre.SRE_Match object; span=(2, 3), match='e'>,
   'compani': 2,
   'deposit_made': True,
   'e-learn': 1,
   'email_verified': True,
   'expertis': 1,
   'facebook_connected': True,
   'identity_verified': False,
   'know': 1,
   'media': 1,
   'payment_verified': True,
   'phone_verified':

In [131]:
# Run a Naive Bayes classifier on above training set using 5 fold cross validation and check accuracy of model

from sklearn.model_selection import KFold
import numpy as np
k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(list2):
    train = [list2[i] for i in train_idx]
    test = [list2[i] for i in test_idx]
    classifier = nltk.NaiveBayesClassifier.train(train)   
    accu.append( nltk.classify.util.accuracy(classifier, test) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu))   



accuracy: 0.5336
accuracy: 0.5104
accuracy: 0.5616
accuracy: 0.5448
accuracy: 0.544435548438751
CV mean accuracy: 0.538967109688


In [132]:
#Train Naive Bayes classifier on entire training set and use it to predict gender on test set


test = pd.read_csv('test.csv')

from ast import literal_eval                          #Parse Json format status as dictionary on test set

status2 = test['status'].apply(literal_eval)

fea2=[]                                             # Creating test set of features
for i in range(len(status2)):
    z={**testfeat[i],**testfeat2[i],**status2[i]}
    fea2.append(z)

fea2
classifier2 = nltk.NaiveBayesClassifier.train(list2)  #Training model on training set list2
pred100 = [classifier2.classify(row) for row in fea2] #Gender prediction on test set fea2
pred100

['F',
 'F',
 'M',
 'M',
 'M',
 'M',
 'F',
 'F',
 'F',
 'M',
 'M',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'F',
 'M',
 'F',
 'F',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'F',
 'M',
 'M',
 'F',
 'M',
 'F',
 'M',
 'M',
 'F',
 'M',
 'F',
 'M',
 'M',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'F',
 'M',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'M',
 'F',
 'F',
 'M',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'F',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'F',
 'F',
 'F',
 'F',
 'F',
 'F',
 'M',
 'M',
 'F',
 'M',
 'F',
 'M',
 'F',
 'F',
 'M',
 'F',
 'F',
 'M',
 'F',
 'M',
 'M',
 'M',
 'F',
 'M'

In [133]:
# Display top 5 most informative features used for classification

classifier2.show_most_informative_features(5)

Most Informative Features
                   femal = 1                   F : M      =     21.7 : 1.0
                  well.i = 1                   F : M      =     18.8 : 1.0
                     cat = 1                   F : M      =     18.8 : 1.0
                     mom = 1                   F : M      =     16.8 : 1.0
                 tourist = 1                   F : M      =     15.9 : 1.0


In [134]:
# Copy test set predictions to output dataset

naiveqn3 = pd.DataFrame({'username':test['username'], 'prediction':pred100})
naiveqn3.to_csv('dav16108naiveqn3.csv', index=False)

In [135]:
# Run a Max Entropy classifier on above training set using 5 fold cross validation and check accuracy of model

k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(list2):
    train = [list2[i] for i in train_idx]
    test = [list2[i] for i in test_idx]
    classifier = nltk.classify.MaxentClassifier.train(train, trace=3, max_iter=2)       
    accu.append( nltk.classify.util.accuracy(classifier, test) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu)) 

  ==> Training (2 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.812
         Final          -0.24406        0.813
accuracy: 0.816
  ==> Training (2 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.815
         Final          -0.28975        0.816
accuracy: 0.804
  ==> Training (2 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.811
         Final          -0.31599        0.812
accuracy: 0.8208
  ==> Training (2 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.814
         Final          -0.30684        0.815
accuracy: 0.8096
  ==> Training (2 iterations)

      Iteration    Log Likelihood    Accura

In [136]:
#Train Max Entropy classifier on entire training set and use it to predict gender on test set

classifier3 = nltk.classify.MaxentClassifier.train(list2, trace=3, max_iter=2)
pred101 = [classifier3.classify(row) for row in fea2]
pred101

  ==> Training (2 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.813
         Final          -0.28704        0.814


['M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M',
 'M'

In [137]:
# Display top 5 most informative features used for classification


classifier3.show_most_informative_features(5)

   0.446 Numchar==5 and label is 'M'
   0.446 build==1 and label is 'M'
   0.446 complet==1 and label is 'M'
   0.446 follow==1 and label is 'M'
   0.446 express==1 and label is 'M'


In [138]:
# Copy test set predictions to output dataset

test = pd.read_csv('test.csv')
maxentropyqn3 = pd.DataFrame({'username':test['username'], 'prediction':pred101})
maxentropyqn3.to_csv('dav16108maxentropyqn3.csv', index=False)

## Description and insights of Question1,2 and 3

In question1, predicting gender from username, I extracted various features from username like first character, last character,
length of name,number of capitalized letters, vowel counts and presence of digits to train the model to predict gender. Two classifiers, Naive Bayes and Maximum Entropy classifier were used to train the model and predict gender on new data sets.
5-fold cross validation was done.Naive bayes classifier predicted new cases with mean accuracy of 71.4% and Max Entropy classifier predicted new cases with mean accuracy of 86.7%.  
First=U,X,Y,Last=O,B were major features to identify female users according to Naive bayes
Vowels=a,e,i,o,u were major features to identify female users according to max entropy classifier

In question2, predicting gender from description, the words and frequency of occurence in description was used to train the model to predict gender. Three classifiers, Naive Bayes, maximum Entropy and Support Vector Classifier were used to train the model and predict gender on test set.
5-fold cross validation was done.Naive bayes classifier predicted new cases with mean accuracy of 56.5% ,Max Entropy classifier predicted new cases with mean accuracy of 75.08%, and Support Vector Classifier predicted new cases with mean accuracy of 74.4%  
Presence of words mom, cat, typeset, cum were major features to identify female users according to Naive bayes
Presence of words lion,build,complete were major features to identify male users according to max entropy classifier


In question3, predicting gender from a combination of username, decsription and status, all the dictionary of features from earlier questions were combined with status dictionary obtained after json parsing. This combined dictionary of features was used to train the model and predict gender for test case.Two models, Naive Bayes and Maximum Entropy Classifier were used to train the model and predict gender on test set.
5-fold cross validation was done.Naive bayes classifier predicted new cases with mean accuracy of 53.8% ,Max Entropy classifier predicted new cases with mean accuracy of 81.2%.
Presence of words femal, cat, mom, tourist were major features to identify female users according to Naive bayes. Presence of words build, complet,follow,express and number of characters=5 were major features to identify male users according to max entropy classifier.