<a id="1"></a> <br>
# 1.Introduction
In this project, you must identify the target of a pronoun within a text passage. The source text is taken from Wikipedia articles. You are provided with the pronoun and two candidate names to which the pronoun could refer. You must create an algorithm capable of deciding whether the pronoun refers to name A, name B, or neither.

The aim of this project is to end the gender bias in pronoun resolution.

<a id="11"></a> <br>
##   1.1 Preparing

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib
import warnings
import sklearn
import gensim
import scipy
import numpy
import json
import nltk
import sys
import csv
import os

In [None]:
sns.set(style='white', context='notebook', palette='deep')
warnings.filterwarnings('ignore')
sns.set_style('white')
%matplotlib inline

<a id="14"></a> <br>
## 1.2 Import dataset

In [None]:
print(os.listdir("../input/"))
gendered_pronoun_df = pd.read_csv('../input/test_stage_1.tsv', delimiter='\t')
submission = pd.read_csv('../input/sample_submission_stage_1.csv')

In [None]:
#Get a feel about the raw data
gendered_pronoun_df.shape

**Then, we know this dataset contains 2000 rows and 9 attributes**

In [None]:
#let's further invetigate the rows and columns
gendered_pronoun_df.head()

<a id="152"></a> <br>
Here are the explanations for each columns or attributes:

1. ID - Unique identifier for an example (Matches to Id in output file format)
1. Text - Text containing the ambiguous pronoun and two candidate names (about a paragraph in length)
1. Pronoun - The target pronoun (text)
1. Pronoun-offset The character offset of Pronoun in Text
1. A - The first name candidate (text)
1. A-offset - The character offset of name A in Text
1. B - The second name candidate
1. B-offset - The character offset of name B in Text
1. URL - The URL of the source Wikipedia page for the example

## 1.3 Check missing data

In [None]:
def check_missing_data(df):
    flag=df.isna().sum().any()
    if flag==True:
        total = df.isnull().sum()
        percent = (df.isnull().sum())/(df.isnull().count()*100)
        output = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
        data_type = []
        for col in df.columns:
            dtype = str(df[col].dtype)
            data_type.append(dtype)
        output['Types'] = data_type
        return(np.transpose(output))
    else:
        return(False)

In [None]:
check_missing_data(gendered_pronoun_df)

**Luckily there is no missing value in this dataset,and we do not need to handle the missing value**

<a id="154"></a> <br>
## 1.4 Statistical Analysis


### 1.4.1 Number of words in the text

In [None]:
gendered_pronoun_df["num_words"] = gendered_pronoun_df["Text"].apply(lambda x: len(str(x).split()))

In [None]:
#Now we calculate the Maximum and Minimum number of words in the Text
print('Maximum number of words in Text is: ',gendered_pronoun_df["num_words"].max())
print('Minimum number of words in Text is:',gendered_pronoun_df["num_words"].min())

### 1.4.2 Number of unique words in the text

In [None]:
gendered_pronoun_df["num_unique_words"] = gendered_pronoun_df["Text"].apply(lambda x: len(set(str(x).split())))
print('Maximum number of unique words in Text is: ',gendered_pronoun_df["num_unique_words"].max())
print('Mean value of unique words in Text is: ',gendered_pronoun_df["num_unique_words"].mean())

### 1.4.3 Number of stopwords in the text

In [None]:
#In computing, stop words are words which are filtered out before processing of natural language data. 
#This step, the Natural Language Toolkit (NLTK) will be used to investigate the stopwords
from nltk.corpus import stopwords
eng_stopwords = set(stopwords.words("english"))
gendered_pronoun_df["num_stopwords"] = gendered_pronoun_df["Text"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))

In [None]:
#Now let's calculate the maximum number of stop words
print('Maximum number of stopwords in Text is: ',gendered_pronoun_df["num_stopwords"].max())

### 1.4.4 Code the pronoun

In [None]:
#Let's investigate what are the pronouns 
pronoun=gendered_pronoun_df["Pronoun"]
np.unique(pronoun)

In [None]:
## Now we code the pronoun so that we can make further analysis
binary = {
    "He": 0,
    "he": 0,
    "She": 1,
    "she": 1,
    "His": 2,
    "his": 2,
    "Him": 3,
    "him": 3,
    "Her": 4,
    "her": 4
}
for index in range(len(gendered_pronoun_df)):
    key = gendered_pronoun_df.iloc[index]['Pronoun']
    gendered_pronoun_df.at[index, 'Pronoun_binary'] = binary[key]
gendered_pronoun_df.head(30)

## 1.5 Exploratory visualization

### 1.5.1 WordCloud

In [None]:
from wordcloud import WordCloud as wc
from nltk.corpus import stopwords
def generate_wordcloud(text): 
    wordcloud = wc(relative_scaling = 1.0,stopwords = eng_stopwords, background_color = 'white').generate(text)
    fig,ax = plt.subplots(1,1,figsize=(10,10))
    ax.imshow(wordcloud, interpolation='bilinear')
    ax.axis("off")
    ax.margins(x=0, y=0)
    plt.show()
    
from nltk.corpus import stopwords
eng_stopwords = set(stopwords.words("english"))

text =" ".join(gendered_pronoun_df.Text)
generate_wordcloud(text)

**There are some words which we can make inference about their gender:**
* daughter
* father
* son
* wife
* mother
* brother
* Mary
* William


In [None]:
gendered_pronoun_df.hist(color = 'orange',figsize=(15,15));

**From the Pronoun binary, we can find that Him/him are mentioned least**

In [None]:
pd.plotting.scatter_matrix(gendered_pronoun_df,color = 'orange',figsize=(20,20))
plt.figure();

**So we can find that there is a linear relationship among [pronoun-offset,A-offset,B-offset,num_words,num_unique_words,num_stopwprds]**


In [None]:
#**Now let's zoom in and further visualize the relationship between pronoun-offset and A-offset**
sns.jointplot(x='Pronoun-offset',y='A-offset',data=gendered_pronoun_df, kind='hex', color ='orange')

In [None]:
#Now let's visualize the relationship between pronoun-offset and B-offset**
sns.swarmplot(x='Pronoun-offset',y='B-offset',data=gendered_pronoun_df,palette = "Blues_d");

In [None]:
sns.violinplot(data=gendered_pronoun_df,x="Pronoun_binary", y="num_words", palette = "Blues_d")

<a id="2"></a> <br>
# 2.Natural Language Processing 
Now we are going to use the Natural Language Toolkit (NLTK) to do some NLP work

## 2.1 Tokenize

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
#first, choose the first row if the Text 
our_text=gendered_pronoun_df.Text[0]
#Then, let's tokenize the text by word and store the result to variable 'words'
words = word_tokenize(our_text)
#Let's tokenize the text by sentence and store the result to variable 'phrases'
phrases = sent_tokenize(our_text)

In [None]:
#print them
print(words)

In [None]:
print(phrases)

<a id="23"></a> <br>
## 2.3 Stop Words
In this step, we are going to filter the stop words in the text

In [None]:
from nltk.corpus import stopwords

In [None]:
stopWords = set(stopwords.words('english'))
words = word_tokenize(our_text)
new_words = []

for w in words:
    if w not in stopWords:
        new_words.append(w)
 
print(new_words)

Great!After filtering the stop words,we compressed text

<a id="24"></a> <br>
## 2.4 Stemming
Stemming is the process of producing morphological variants of a root/base word. 
For example, after stemming, eating, eaten, ate, eats, eatings will be shown as **eat**

In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
 
for word in word_tokenize(our_text):
    print(ps.stem(word))

<a id="25"></a> <br>
## 2.5 Speech tagging
Speech Tagging can label words such as verbs, nouns and so on.

In [None]:
import nltk
from nltk.tokenize import PunktSentenceTokenizer

sentences = nltk.sent_tokenize(our_text)   
for sent in sentences:
    print(nltk.pos_tag(nltk.word_tokenize(sent)))

After speech tagging, we can choose the words based on their type

In [None]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

sentences = nltk.sent_tokenize(our_text)   
data = []
for sent in sentences:
    data = data + nltk.pos_tag(nltk.word_tokenize(sent))

# Now let's choose Pronoun
for word in data: 
    if 'PRP' in word[1]: 
        print(word)