## Prediciting Demography on large datasets using NLP 

This notebook is dedicated to using NLP and machine learning on keywords related to website searches. We will use this to predict demgraphic groups in our data, more specifically on age and gender. The dataset being 1.5 GB large will require some transformations abd processing steps. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import collections

from sklearn.metrics import confusion_matrix
import os
import matplotlib.pyplot as plt
os.chdir('/Users/JohanLg/Documents/My Documents/ESCP/Kurser/Vår/Python')

In [2]:
data = pd.read_csv('train2.csv', sep=',')

In [3]:
df = data.sample(frac=0.01, replace=False, random_state=1)

# Large dataset, sample 1% for exploration

In [4]:
## Dropping rows with null values
Cleantr = df.dropna()
print(Cleantr.head())
print("\n")
print("Number of occurances %d8" % len(Cleantr))

              ID                                           keywords  age sex
244242    392037  des:1;protection:1;offre:1;risques:1;preventio...   39   M
4143485  1782381                                          annonce:3   45   F
5262074   433807  relation:1;apres:1;avec:1;consentante:1;sexuel...   48   F
1803744   181848               terrasse:1;auto:1;accident:1;ville:1   52   M
2110117   386562  livre:1;3eme:1;transmath:1;affich:1;forum:1;co...   49   F


Number of occurances 641878


Now that we have a clean dataset,the idea is to "untangle" the keywords so that we can apply standard NLP techniques like TF-IDF. We want to create a dictonary of all the words used in the keywords seaches. First step is to create the dictionary, where we will

1. Split the keyword column into word and count
2. Multiply the word by its count
3. Create a dictionary

Step 1. Splitting

In [5]:
#splitting keywords

def split_keywords_expand(x):
    result = ""
    if isinstance(x, str):
        for word_count in x.split(";"):
            if len(word_count.split(":")) == 2:
                word, count = word_count.split(":")
            else:
                word = word_count
                count = 1
            for _ in range(int(count)):
                result += " " + word
    return result


In [6]:
Cleantr['keywords']=Cleantr['keywords'].apply(split_keywords_expand)

In [7]:
Cleantr.head()

Unnamed: 0,ID,keywords,age,sex
244242,392037,des protection offre risques prevention charge,39,M
4143485,1782381,annonce annonce annonce,45,F
5262074,433807,relation apres avec consentante sexuelle actu...,48,F
1803744,181848,terrasse auto accident ville,52,M
2110117,386562,livre 3eme transmath affich forum corrige,49,F


Step 3. Create Dictionary 

We see that this dataframe shows all keyword in its multiplied form. Now we shall apply NLP by:

1. Removing Stem-Stop words.
2. Counting ALL occurances of words.
3. Using the counts or weighted counts in a model.

In [8]:
from nltk import stem
from nltk.corpus import stopwords
import nltk
nltk.download("stopwords")
stemmer = stem.SnowballStemmer('french')
stopwords = set(stopwords.words('french'))

def review_messages(msg):
    # converting messages to lowercase
    msg = msg.lower()
    # removing stopwords
    msg = [word for word in msg.split() if word not in stopwords]
    # using a stemmer
    msg = " ".join([stemmer.stem(word) for word in msg])
    return msg


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/JohanLg/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
s =Cleantr['keywords'].apply(review_messages)

Here we have a clean list of words ready to be made into a dictionary. First we split our data and make the train-sets cleaner

In [11]:
Y = Cleantr['sex']

In [33]:
#split data in test and training dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(s,  Y, test_size = 0.1, random_state = 1)

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_count = CountVectorizer()
from sklearn.metrics import accuracy_score  

In [14]:
dic = vectorizer_count.fit(X_train)
    # Creates the dictionary of words and word counts

print(dic.vocabulary_)

X_train_count = vectorizer_count.fit_transform(X_train)
    # Creates a numeric array of these words for ml models





In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer_tfidf = TfidfVectorizer()
dic_tdif = vectorizer_tfidf.fit(X_train)
X_train_tfidf=dic_tdif.fit_transform(X_train)
print(dic_tdif.vocabulary_)




In [17]:
print(pd.DataFrame(dic_tdif.idf_)[0].value_counts().head())
print(type(dic_tdif.vocabulary_))

11.271060    26625
10.865595     5425
10.577913     2561
10.354770     1734
10.172448     1182
Name: 0, dtype: int64
<class 'dict'>


## Predicting Gender

In [26]:
vectorizer_count.fit(X_train)
X_trainA =vectorizer_count.transform(X_train)
X_testA =vectorizer_count.transform(X_test)

In [27]:
def modeleval(stuff,stuff1,stuff2,stuff4):
    models = [stuff1,stuff2,stuff2,stuff4]
    for model1 in models:
        model = model1
        Algo = model.fit(X_trainA, y_trainA)
        predictions = Algo.predict(X_test)
    
        print( "-----%s-----" % model1)
        print("\n")
        print(confusion_matrix(y_test, predictions))
        print( "\n")
        print("Accuracy Score ; %8.2f" % (accuracy_score(predictions, y_test)*100))
        print( "\n")

In [28]:
from sklearn import svm
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
modeleval(svm, RandomForestClassifier(), MultinomialNB(), xgb.XGBClassifier())

-----RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)-----


[[1378 1575]
 [1037 2429]]


Accuracy Score ;    59.31


-----MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)-----


[[ 706 2247]
 [ 387 3079]]


Accuracy Score ;    58.97


-----MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)-----


[[ 706 2247]
 [ 387 3079]]


Accuracy Score ;    58.97


-----XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bytree=1, gamma=0, learning_ra

In [29]:
BestModelGender = RandomForestClassifier()
mlb_tfidf = BestModelGender.fit(X_train, y_train)

## Predicting Age

In [30]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
from sklearn import metrics


In [50]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(s, Cleantr['age'], test_size = 0.1, random_state = 1)

In [51]:
vectorizer_tfidf.fit(X_train1)
X_train1 =vectorizer_tfidf.transform(X_train1)
X_test1 =vectorizer_tfidf.transform(X_test1)

In [52]:
lin = linear_model.Ridge(alpha=.1)

In [53]:
X_train

<57768x48852 sparse matrix of type '<class 'numpy.float64'>'
	with 971721 stored elements in Compressed Sparse Row format>

In [54]:
def modeleval_age(stuff,stuff1,stuff2,stuff4):
    models = [stuff1,stuff2,stuff2,stuff4]
    for model1 in models:
        model = model1
        Algo = model.fit(X_train1, y_train1)
        pred = Algo.predict(X_test1)
    
        print( "-----%s-----" % model1)
        print("\n")
        print('Mean Absolute Error:', metrics.mean_absolute_error(y_test1, pred))  
        print('Mean Squared Error:', metrics.mean_squared_error(y_test1, pred))  
        print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test1, pred)))
        print("\n")

In [None]:
modeleval_age(lin , xgb.XGBRegressor(), RandomForestClassifier(), MultinomialNB())

-----XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=True, subsample=1)-----


Mean Absolute Error: 10.355412299136429
Mean Squared Error: 162.59158551531175
Root Mean Squared Error: 12.75114055742904




In [82]:
BestModelAge = xgb.XGBRegressor()
xg_reg = BestModelAge.fit(X_train1, y_train1)

## Predicting TEST set

In [83]:
DF1 = []
DF2 = []

for chunk in pd.read_csv('test.csv', chunksize = 10000):
    chunk = chunk['keywords']
    chunk = chunk.dropna()
    chunk = chunk.apply(split_keywords_expand)
    hot = pd.DataFrame(vectorizer_count.transform(chunk).toarray())
    Gender = mlb_tfidf.predict(hot) 
    Age = xg_reg.predict(hot)
    DF1.append(Gender)
    DF2.append(Age) 

In [84]:
Final_G = []
Final_A = []
for i in range(0,len(DF1)):
        G = pd.DataFrame(DF1[i])
        A = pd.DataFrame(DF2[i])
        Final_G.append(G)
        Final_A.append(A)
Final = pd.concat(Final_G) 
Final['Age'] = pd.concat(Final_A)


In [87]:
Final.reset_index()

Unnamed: 0,index,0,Age
0,0,F,49.162727
1,1,M,53.760742
2,2,M,45.062714
3,3,F,45.827866
4,4,M,45.062714
...,...,...,...
2748738,1565,M,48.663189
2748739,1566,F,45.062714
2748740,1567,F,44.846558
2748741,1568,F,38.589443


In [None]:
Final.to_csv('Final_Predictions_Group_8.csv')