
[link text](https://)
#### Environment: Python 3.7 and Anaconda 5.2.0-py36_3

### Introduction
In this Assignment we are dealing with Authorship analysis, going beyond the author identification and author verification tasks. The aim of this challenge is to develop a classifier that can assign a set of twitter texts to their corresponding labels. while focusing on the gender classification task, where we are required to develop a classifier that can identify the gender of the tweet’s author as accurately as possible. We have chosen to do this assignment in Python.

### The Steps we have taken for the Preprocessing and Feature extractions include:

* **Case Normalization,**

* **Filtering Tweets to remove any traces of emoticons, symbols, pictographs, transport, map symbols and flags (iOS),*** 

* **Removing URLs,**

* **Tweet Tokenization (Unigrams).**

* **Bi-grams.**

* **Stop words removal.**

* **Removing Punctuations.**

* **Removing Numbers.**

* **POS Tagging.**

* **TF-IDF5 (Term Frequency-Inverse Document Frequency).**

### The different types of classification methods we used in the context of gender classification task are :

**(a) Support vector machine (SVM) ,**

**(b) Logistic Regression,** 

**(c) Random forest,**

**(d) Perceptron linear Classiffier**


The below python code has been documented for better understanding.



In [None]:
import shutil
import os
os.chdir("/content/sample_data")
shutil.unpack_archive("data.zip", "data1")

In [None]:
! pip install emot

Collecting emot
  Downloading https://files.pythonhosted.org/packages/49/07/20001ade19873de611b7b66a4d5e5aabbf190d65abea337d5deeaa2bc3de/emot-2.1-py3-none-any.whl
Installing collected packages: emot
Successfully installed emot-2.1


#### Importing all the required packages

In [None]:
import re                   #for regular expressions
import pandas as pd         #for data manipulation
import numpy as np                          #for scientific computing
from itertools import chain                 #for chain iteration over values in dictionary
import itertools                            #for itertools
import copy
#from stop_words import get_stop_words
import string
from collections import Counter
#from emoji import UNICODE_EMOJI
from emot.emo_unicode  import UNICODE_EMO , EMOTICONS

from matplotlib import pyplot as plt
import tensorflow as tf
import sys
from importlib import reload
import keras
import xml.dom.minidom
import xml.etree.ElementTree as ET

import os
from bs4 import BeautifulSoup
import spacy

import nltk                 #for NLTK packages such as puntk(segmentation)
from nltk.tokenize import sent_tokenize,regexp_tokenize      #for extracting tokens
from nltk.corpus import stopwords                            #for Removing stop words 
from nltk.stem import WordNetLemmatizer                      
from nltk.probability import *              #for FreqDist
from nltk.collocations import *             #for collocations
from nltk.tokenize import RegexpTokenizer   #divide strings into substrings
from nltk.tokenize import MWETokenizer      #for multi word tokenizer
from nltk.tokenize.casual import TweetTokenizer
from nltk.tokenize import wordpunct_tokenize


from sklearn.feature_extraction.text import TfidfVectorizer   #Convert a collection of raw documents to a matrix 
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix, matthews_corrcoef
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, GaussianNB,BernoulliNB
from sklearn.svm import LinearSVC, SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score 
from sklearn import model_selection, svm
from sklearn.linear_model import Perceptron

In [None]:
os.chdir("/content/sample_data/data1")    #Changing directory to extract the data

#### Extracting the data from xml files 

In [None]:
#Fucntion to extract the data
def extract_file(file_name):                
    infile = open(file_name,encoding="utf8")
    contents = infile.read()
    soup = BeautifulSoup(contents,'xml')
    #extracting language
    author_tag = soup.find('author')
    lang = author_tag.get('lang')
    #extracting documents text 
    document = soup.find_all('document')
    documents_list = []
    for i in document:
        documents_list.append(i.get_text())
    return [documents_list,lang]

In [None]:
xml_names = []
language_list = []
allfiles_list = []
directory = os.chdir("/content/sample_data/data1/data")

for file in os.listdir(directory):
    if file.endswith(".xml"):   # Matching all xml files
        match = re.search(r'([\w]+)',file)
        if match:
            xml_names.append(match.group(1))
        files_data = re.search(r'([\w]+\.xml)',file)
        if files_data:
            allfiles_list.append((extract_file(files_data.group(1)))[0])
            language_list.append((extract_file(files_data.group(1)))[1])
        

#### Creating the dataframe which includes train and test data

In [None]:
documents_df = pd.DataFrame(columns = ['id','language', 'Tweets_data'])
documents_df['id'] = xml_names
documents_df['language'] = language_list
documents_df['Tweets_data'] = allfiles_list
print(documents_df.shape)
documents_df.head()         

(3600, 3)


Unnamed: 0,id,language,Tweets_data
0,b45b74a9a24bb30a153ef5b334a8423c,en,[Your mind swirls with fantastic visions and y...
1,8ed8cb11745ebea846fd90368151ae04,en,[I wasn't convinced that NZ scientists shld ma...
2,3d099d395ee39fdd4e620aa1a420b8f7,en,[Yaaaaassssss!!! Kesha wins in court against a...
3,b57c7d2d4726305d6c5d4f781e456b5,en,[Once my workout is done I consider the day ov...
4,db024f43f1c62e3c8b9d05c1ae3ac9ce,en,[#Travel How free Ryanair flights are the futu...


#### Loading Training and Testing data

In [None]:
# loading train data
os.chdir("/content/sample_data/data1")
train_labels = pd.read_csv('train_labels.csv')
train_labels.head()

Unnamed: 0,id,gender
0,d7d392835f50664fc079f0f388e147a0,male
1,ee40b86368137b86f51806c9f105b34b,female
2,919bc742d9a22d65eab1f52b11656cab,male
3,15b97a08d65f22d97ca685686510b6ae,female
4,affa98421ef5c46ca7c8f246e0a134c1,female


In [None]:
#Loading Test data
test_labels_data = pd.read_csv('test.csv')
test_labels_data.head()

## Preprocessing

### Case Normalization

In [None]:
#Converting to Lower-case: 
documents_df['Tweets_data'] = documents_df['Tweets_data'].apply(lambda x : " ".join(x))
documents_df['Tweets_data'] = documents_df['Tweets_data'].apply(lambda x: "".join(str(x.lower())))

In [None]:
documents_df['Tweets_data']

0       your mind swirls with fantastic visions and yo...
1       i wasn't convinced that nz scientists shld mar...
2       yaaaaassssss!!! kesha wins in court against ac...
3       once my workout is done i consider the day ove...
4       #travel how free ryanair flights are the futur...
                              ...                        
3595    @janeseyd bike ride downtown was 25! comforted...
3596    @dan_pathfinder @irishwildlife @antaisce @peat...
3597    i can't believe i'm getting my braces off on s...
3598    cmon australia, keep #celebtom in there &amp; ...
3599    i'm so hungry but we just moved flat so i have...
Name: Tweets_data, Length: 3600, dtype: object

### Removing URLs

In [None]:
#Creating a new column filt for helping filtering process
documents_df['filt'] = documents_df['Tweets_data']
for i in range(len(documents_df)):                   
    documents_df['filt'][i] = copy.deepcopy(documents_df['Tweets_data'][i])       #Using deepcopy to avoid affecting values in filt column
documents_df.columns

Index(['id', 'language', 'Tweets_data', 'filt'], dtype='object')

In [None]:
#Removing URLs
count = 0
for i in range(len(documents_df['filt'])):
    for j in range(len(documents_df['filt'][i])):
        if(re.findall(r'http\S+',documents_df['filt'][i][j])):   # findind only the tweets containing URLs
            count = count+1                                      # Counting 
            #print(re.findall(r'http\S+',documents_df['filt'][i][j]))
            documents_df['filt'][i][j] = re.sub(r'http\S+','',documents_df['filt'][i][j]) #substituting and assigning
print(count)         

159297


### Tweet Tokenization

### Generating Unigrams

In [None]:
documents_df['Unigram_Tweets_data'] = documents_df['filt']    #Creating a new column for unigrams
for i in range(len(documents_df)):
    #Using deepcopy to avoid affecting values in filt column
    documents_df['Unigram_Tweets_data'][i] = copy.deepcopy(documents_df['filt'][i])  #deep copy
documents_df.columns

Index(['id', 'language', 'Tweets_data', 'filt', 'Unigram_Tweets_data'], dtype='object')

In [None]:
for i in range(len(documents_df['filt'])):         #Authours loop
    for j in range(len(documents_df['Tweets_data'][i])):  #Tweets loop
        # Setting the parameters to Remove Twitter username handles and  from text and,
        # Replace repeated character sequences of length 3 or greater with sequences of length 3.
        documents_df['Unigram_Tweets_data'][i][j] = TweetTokenizer(strip_handles=True, reduce_len=True).tokenize(documents_df['filt'][i][j])


In [None]:
print(id(documents_df['Unigram_Tweets_data'][0][1]))  #IDs are different as we have used deepcopy
print(id(documents_df['Tweets_data'][0][1]))

140058744817352
140058771663296


### Generating Bi-Grams

In [None]:
documents_df['Bigram_Tweets_data'] = documents_df['Unigram_Tweets_data']
for i in range(len(documents_df)):
    documents_df['Bigram_Tweets_data'][i] = copy.deepcopy(documents_df['Unigram_Tweets_data'][i])
documents_df.columns

Index(['id', 'language', 'Tweets_data', 'filt', 'Unigram_Tweets_data',
       'Bigram_Tweets_data'],
      dtype='object')

In [None]:
#Bigrams
for i in range(len(documents_df)):
    for j in range(len(documents_df['Bigram_Tweets_data'][i])):
        documents_df['Bigram_Tweets_data'][i][j] = list(nltk.bigrams(documents_df['filt'][i][j]))


### Stop Words removal

In [None]:
# Creating a new column to store after stop words are filtered 
documents_df['No_SW'] = documents_df['Unigram_Tweets_data']
for i in range(len(documents_df)):
    documents_df['No_SW'][i] = copy.deepcopy(documents_df['Unigram_Tweets_data'][i])
documents_df.columns

Index(['id', 'language', 'Tweets_data', 'filt', 'Unigram_Tweets_data',
       'Bigram_Tweets_data', 'No_SW'],
      dtype='object')

In [None]:
#Function to remove Stopwords
def remove_sw(txt):                      
    filt = [w for w in txt if not w in stop_w]
    if(filt):
        return filt
    else:
        return('')

In [None]:
#Stopwords removal
for i in range(len(documents_df['Unigram_Tweets_data'])):
    authour_lang = documents_df['language'][i]
    stop_w = get_stop_words(authour_lang)
    for j in range(len(documents_df['Unigram_Tweets_data'][i])):
        documents_df['No_SW'][i][j] = remove_sw(documents_df['Unigram_Tweets_data'][i][j])



### Removing Punctuations

In [None]:
# Function to remove punctuations
def remove_punct(txt):
    filt = [r for r in txt if r not in string.punctuation]
    if(filt):
        return filt
    else:
        return('')

In [None]:
documents_df['No_punct'] = documents_df['No_SW']  #Creating a new column to store filereted data
for i in range(len(documents_df)):
    documents_df['No_punct'][i] = copy.deepcopy(documents_df['No_SW'][i])  # deep copy
documents_df.columns

Index(['id', 'language', 'Tweets_data', 'filt', 'Unigram_Tweets_data',
       'Bigram_Tweets_data', 'No_SW', 'No_punct'],
      dtype='object')

In [None]:
#Removing Punctuations
for i in range(len(documents_df['No_punct'])):
    for j in range(len(documents_df['No_punct'][i])):
        documents_df['No_punct'][i][j] = remove_punct(documents_df['No_SW'][i][j]) 

### Removing numbers 

In [None]:
documents_df['fully_filtered'] = documents_df['No_punct']  #Creating a new column to store filereted data
for i in range(len(documents_df)):
    documents_df['fully_filtered'][i] = copy.deepcopy(documents_df['No_punct'][i])  # deep copy
documents_df.columns

Index(['id', 'language', 'Tweets_data', 'filt', 'Unigram_Tweets_data',
       'Bigram_Tweets_data', 'No_SW', 'No_punct', 'fully_filtered'],
      dtype='object')

In [None]:
#Removing Numbers
for i in range(len(documents_df['fully_filtered'])):
    for j in range(len(documents_df['No_punct'][i])):
        temp = []
        documents_df['fully_filtered'][i][j] = documents_df['No_punct'][i][j]
        for each in documents_df['No_punct'][i][j]:
            if((each).isdigit()):   #Checking
                temp.append(each)
                print(each)         #Printing the removed number
        if(temp):
            print(temp)
            for each in temp:
                (documents_df['fully_filtered'][i][j]).remove(each)   #Filtering

### Part-of-speech Tagging

In [None]:
documents_df['final_pro'] = documents_df['fully_filtered']          #Creating a new column to store filereted data
for i in range(len(documents_df)):
    documents_df['final_pro'][i] = copy.deepcopy(documents_df['fully_filtered'][i])    # deep copy

In [None]:
for i in range(len(documents_df['id'])):
    documents_df['final_pro'][i] = (list(itertools.chain.from_iterable(documents_df['fully_filtered'][i])))

In [None]:
from emoji import UNICODE_EMOJI
# Function to detect emoji

def is_emoji(s):         #Function to decode emoji and its unicode
    return s in UNICODE_EMOJI

In [None]:
documents_df['BOG'] = 0     #Creating a new column to store Bag of words data

### Data Accumulation:
####Approach1
We have considered the entire vocab of words which inludes all prts of speech . 

In [None]:
nltk.download('wordnet')
# Each row of bag of words are converted to one string to be retokenized in order to tag parts of speech to the bag of words                                  
lemmatizer = WordNetLemmatizer()      #Initialising a lemmatizer object
for i in range(len(documents_df['final_pro'])):
    feature_words = []   
    x = documents_df['final_pro'][i]
    y = ' '.join(word for word in x)
    wordsList = nltk.word_tokenize(y) 
    pos = nltk.pos_tag(wordsList)
    for j in range(len(pos)):                                  
        feature_words.append(lemmatizer.lemmatize(pos[j][0]))          # Appending all POS                            
    documents_df['BOG'][i] = feature_words

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


### Data Accumulation : Approach2
We have considered only Nouns and Emojis from the entire corpus as the input data. We considered Nouns because it conveys significant information like Name,Place, etc. And considering Emojis may help us capture a trend if there exists for classification.


NOTE: Run this section of code for testing on models for this approaach of data accumulation

In [None]:
#nltk.download('averaged_perceptron_tagger')
#nltk.download('punkt')                                   

# Each row of bag of words are converted to one string to be retokenized in order to tag parts of speech to the bag of words
#for i in range(len(documents_df['final_pro'])):
#    feature_words = []  
#    x = documents_df['final_pro'][i]
#    y = ' '.join(word for word in x)
#    wordsList = nltk.word_tokenize(y) 
#    pos = nltk.pos_tag(wordsList)
#    for j in range(len(pos)):                                  
#        if pos[j][1] =="NN" or is_emoji(pos[j][0]):    # Filtering only Nouns and Emojis
#            feature_words.append(pos[j][0])                                      
#    documents_df['BOG'][i] = feature_words

#wordsList

#### Merging our dataframe with the given Train and Test data based on ID   to form train and test data

In [None]:
# Merging our preprocessed datafram with the given train_labels data based on ID
train_labels = pd.merge(train_labels, documents_df, how = 'left', on = 'id')  
print(train_labels.shape)


(3100, 12)


In [None]:
# Merging our preprocessed datafram with the given train_labels data based on ID
test_labels_data = pd.merge(test_labels_data, documents_df, how = 'left', on = 'id')
print(test_labels_data.shape)


(500, 12)


#### Feature Extraction using Tfidfvectorizer from sklearn 

---

package 

---


We now extract important features as input for training the model

In [None]:
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
#documents_df_1 = documents_df.apply(lambda x:vectorizer.fit_transform(x['BOG']),axis = 1)
tfx_train = vectorizer.fit_transform(' '.join(value) for value in train_labels['BOG'])
y_train = vectorizer.transform(train_labels['gender'])
tfx_test = vectorizer.transform(' '.join(value) for value in test_labels_data['BOG'])



# summarize_
print(vectorizer.vocabulary_)
print(vectorizer.idf_)


[4.67003207 2.99447461 8.34633274 ... 8.34633274 8.34633274 8.34633274]


In [None]:
test_new = pd.read_csv('test_labels.csv')   #Loading Test data
test_new.shape

(500, 2)

In [None]:
train_labels.gender[train_labels.gender == 'male'] = 1          #Converting Males to 1 in train_labels
train_labels.gender[train_labels.gender == 'female'] = 0        #Converting Females to 0 in train_labels
test_new.gender[test_new.gender == 'male'] = 1                  #Converting Males to 1 in test_labels
test_new.gender[test_new.gender == 'female'] = 0                #Converting Females to 0 in test_labels

### Model performance and evaluation using various claassifiers
* Support Vector Machine Classiffier 
* Logistic Classffier 
* Random Forest Classiffier 
* Perceptron Model ( Linear classiffier Model)


Support Vector Machine Classifier 

In [None]:
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(tfx_train,list(train_labels['gender']))
# predict the labels on validation dataset
predictions_SVM = SVM.predict(tfx_test)
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM,list(test_new['gender']))*100)

SVM Accuracy Score ->  80.2


### Logistic Classifier

In [None]:
logisticRegr = LogisticRegression()          
logisticRegr.fit(tfx_train,list(train_labels['gender']))
predictions_log = logisticRegr.predict(tfx_test)
# Use accuracy_score function to get the accuracy
print("Logistic regression Accuracy Score -> ",accuracy_score(predictions_log,list(test_new['gender']))*100)

Logistic regression Accuracy Score ->  79.80000000000001


### Random Forest Classifier 

In [None]:
clf = RandomForestClassifier(n_estimators=500)          #Random Forest
clf.fit(tfx_train,list(train_labels['gender']))

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [None]:
predictions_randf = clf.predict(tfx_test)
# Use accuracy_score function to get the accuracy
print("Random forest classifier  Accuracy Score -> ",accuracy_score(predictions_randf,list(test_new['gender']))*100)

### Perceptron

In [None]:
per_cl = Perceptron(tol=1e-3, random_state=0)
per_cl.fit(tfx_train,list(train_labels['gender']))      #Perceptron
prct_pred = per_cl.predict(tfx_test)
# Use accuracy_score function to get the accuracy
print("Perceptron classifier Accuracy Score -> ",accuracy_score(prct_pred,list(test_new['gender']))*100)

Perceptron classifier Accuracy Score ->  77.60000000000001


In [None]:
pred = pd.DataFrame(columns = ['id','gender'])

In [None]:
pred['id'] = test_labels_data['id']
pred['gender'] = predictions_SVM

In [None]:
pred.gender[pred.gender == 1] = 'male'                  #Converting Males to 1 in test_labels
pred.gender[pred.gender == 0] = 'female'                #Converting Females to 0 in test_labels

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [None]:
pred.head()

Unnamed: 0,id,gender
0,d6b08022cdf758ead05e1c266649c393,male
1,9a989cb04766d5a89a65e8912d448328,female
2,2a1053a059d58fbafd3e782a8f7972c0,male
3,6032537900368aca3d1546bd71ecabd1,male
4,d191280655be8108ec9928398ff5b563,male


In [None]:
pred.to_csv('pred_label.csv',index= False)