This notebook contains a small analysis of the dataset we are working with.

In [1]:
import pandas as pd

#### Reading the dataset

In [2]:
df = pd.read_csv("data.tsv",sep="\t")

#### Details about the Dataset

In [3]:
schema_df = pd.read_csv('schema.csv', index_col='column name')

In [4]:
df['Selected'].value_counts();

#### Cleaning the Dataset for preprocessing

- The column named `Unnamed: 22` is removed
- Longer Columns Names are reduced to short names
- Changing inputs from the `Selected` column to `1` or `0` 
- Rounding up the Age

In [5]:
# removing the a column for the dataframe
del df['Unnamed: 22']
# changing the names of the columns
df.rename(columns={'What do you think is your life’s purpose and why do you think having a purpose is important? (100 words or less)':'Purpose','Describe 5 things that attracted you to Unilever.':'About company'}, inplace=True)
# changing all the Yes and No to True or False
df['Selected'] = df['Selected'].map({'Yes':1,'No':0})
df['Working Experience'] = df['Working Experience'].map({'Yes':1,'No':0})
df['Gender'] = df['Gender'].map({'Male':1,'Female':0})
df['LinkedIn Profile'] = df['LinkedIn Profile'].map({'Yes':1,'No':0})
# rounding off the age
df['Age'] = df['Age'].apply(round)

#### Importing modules for NLP

In [6]:
from nltk.tokenize import sent_tokenize, word_tokenize, PunktSentenceTokenizer
from nltk.corpus import stopwords
from nltk import ne_chunk
from nltk import pos_tag
from re import sub

STOPWORDS = set(stopwords.words("english"))

#### Algorithms for cleaning up sentences

In [7]:
from src.utils.clean_sentences import cleanup_sentence, cleanup_bulletpoints, cleanup_brackets

#### Scoring Algorithm for Clubs

In [8]:
from src.utils.extract_keypoints_clubs import clubs_main

In [9]:
df["Clubs & Associations"] = df["Clubs & Associations"].fillna("None")

In [10]:
df["Clubs & Associations"] = df["Clubs & Associations"].apply(clubs_main)

#### Scoring algorithm for prof. qualification

In [11]:
from src.utils.extract_keypoints_qualification import prof_qualification_main

In [12]:
df['Professional Qualification'] = df['Professional Qualification'].fillna("None")

In [13]:
df["Professional Qualification"] = df["Professional Qualification"].apply(prof_qualification_main) 

#### Calculating the score for sports

In [14]:
from src.utils.extract_keypoints_sport import sports_main

In [15]:
df['Sports'] = df['Sports'].fillna("None")

In [16]:
df['Sports'] = df['Sports'].apply(sports_main)

In [17]:
filt = (df["Selected"] == True);

In [18]:
df.loc[filt,["Name","Sports"]];

#### Calculating the score for leadership

In [19]:
from src.utils.extract_keypoints_leadership import leadership_main

In [20]:
df['Leadership'] = df['Leadership'].fillna("None")

In [21]:
df['Leadership'] = df['Leadership'].apply(leadership_main)

In [22]:
df.loc[filt,["Name","Sports","Leadership","Clubs & Associations","Professional Qualification"]];

#### Calculating the score of skills and abilities

In [23]:
from src.utils.extract_keypoints_skills import find_similarity

In [24]:
df['Skills & Abilities'] = df['Skills & Abilities'].fillna("None")

In [25]:
df['Skills & Abilities'] = df['Skills & Abilities'].apply(find_similarity)

  return doc1.similarity(doc2)


In [26]:
df.loc[filt,["Name","Skills & Abilities"]];

#### Calculating the score  for "purpose" column

In [27]:
from src.utils.extract_keypoints_purpose import find_similarity

In [28]:
df['Purpose'] = df['Purpose'].fillna("None")

In [29]:
df['Purpose'] = df['Purpose'].apply(find_similarity)

  return doc1.similarity(doc2)


In [30]:
df.loc[filt,["Name","Purpose"]];

#### Calculating the score for "company" column

In [31]:
from src.utils.extract_keypoints_company import find_similarity

In [32]:
df['About company'] = df['About company'].fillna("None")

In [33]:
df['About company'] = df['About company'].apply(find_similarity)

  return doc1.similarity(doc2)


#### Calculating the score for "Function" column

In [34]:
df['Function'].unique();

In [35]:
df['Area Of Study'].unique();

### Classification

Dropping `area of study` , `degree`

In [36]:
inputs = df.drop(['Selected','Name','Graduation Year','Date of Birth','Area Of Study','Degree Course','University','Duration'],axis='columns')

separating target variable from the input variable

In [37]:
target = df['Selected']
target.head()

0    1
1    1
2    1
3    1
4    1
Name: Selected, dtype: int64

Encoding the categorical values in the dataset

In [38]:
from sklearn.preprocessing import LabelEncoder

In [39]:
label_function = LabelEncoder()
label_source = LabelEncoder()
label_class = LabelEncoder()

In [40]:
inputs['function_label'] = label_function.fit_transform(inputs['Function'])
inputs['source_label'] = label_source.fit_transform(inputs['Source of Information'])
inputs['class_label'] = label_class.fit_transform(inputs['Class'])

dropping the columns into new table

In [41]:
new_df = inputs.drop(['Function','Source of Information','Class'],axis='columns')
new_df

Unnamed: 0,Age,Gender,LinkedIn Profile,Professional Qualification,Clubs & Associations,Sports,Leadership,Skills & Abilities,Working Experience,Purpose,About company,function_label,source_label,class_label
0,24,1,1,0.0,0.5,1,0.0,0.855535,0,0.845985,0.597256,0,4,3
1,24,0,1,0.0,0.5,2,3.0,0.778193,1,0.865453,0.865408,2,2,6
2,24,0,1,0.0,0.0,4,3.0,0.816065,1,0.827561,0.845935,3,4,6
3,26,1,1,0.0,0.0,1,0.0,0.830143,1,0.805668,0.835214,4,0,2
4,26,0,1,1.5,3.0,1,2.0,0.823671,0,0.809380,0.812789,2,4,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
889,26,0,1,0.0,0.0,2,0.0,0.765063,0,0.858381,0.782030,0,1,6
890,21,0,1,1.0,0.0,1,0.0,0.741437,0,0.850830,0.794379,3,1,2
891,23,1,1,0.0,0.0,1,0.0,0.775088,0,0.836564,0.781565,3,4,6
892,20,1,1,0.0,0.0,2,0.0,0.874758,1,0.860528,0.848885,1,2,2


splitting the dataset training and test data

In [42]:
from sklearn.model_selection import train_test_split

In [43]:
input_train, input_test, target_train, target_test = train_test_split(new_df, target, test_size = 0.2)

Training the dataset

In [44]:
from sklearn import tree

In [45]:
model = tree.DecisionTreeClassifier()

In [46]:
model.fit(input_train, target_train)

DecisionTreeClassifier()

In [47]:
model.score(input_test, target_test)

0.8994413407821229

In [48]:
new_df.columns

Index(['Age', 'Gender', 'LinkedIn Profile', 'Professional Qualification',
       'Clubs & Associations', 'Sports', 'Leadership', 'Skills & Abilities',
       'Working Experience', 'Purpose', 'About company', 'function_label',
       'source_label', 'class_label'],
      dtype='object')

### Prediction

In [49]:
print("1\t",list(label_function.classes_))
print("2\t",list(label_source.classes_))
print("3\t",list(label_class.classes_))

1	 ['Customer Development', 'Finance', 'HR', 'Marketing', 'Supply Chain (Engineering Students are Preferred)']
2	 ['Facebook', 'Instagram', 'University', 'Website', 'Word of mouth']
3	 ['Distinction', 'Expected - First Class', 'First Class', 'General', 'Merit', 'Second Lower', 'Second Upper', 'Second upper']


Customer Development label is 0, Finance label is 1 ...

Facebook label is 0, Instagram label is 1 ...

Distinction label is 0, Expected - First Class is 1

#### Change the values of the variable

In [55]:
Age = 21
Gender = 0 # if male it is 1 if female it is 0
LinkedIn = 0 # if it is present 1 if not 0
Professional_Qualification= "I have done CIMA, CIM, ACCA, CA" # enter paragraph here
Clubs = "I have being a member in LEO and AIESEC" # enter clubs you have gone to
Sports = "Basketball, swimming , hockey"#enter the sports played
Leadership = "Captain in the Basketball team and President of sports club"
Skills  = " Team work Communication Emotional Intelligence Leadership\
        Problem Solving Negotiation Creativity Public Speaking\
        PresentationEmpathy Listening IT Skills PowerBI"
work_exp =  1 # 1 if yes else 0
purpose = """Plan good Society Happiness Energy Focus Future Service
Navigate Impact Sustainability Legacy Unique Authentic Better Safe """
company = """Work Culture Learning Exposure Global Multinational
Purpose Pioneer Enviornment Development leadership Career path
Inclusion Agile Diversity respect team work  Salary growth """

In [56]:
function = 2 # Finance
source = 0 # Facebook
_class = 0 # distinction

#### Getting scores of sentences

In [57]:
Professional_Qualification = prof_qualification_main(Professional_Qualification)
Clubs = clubs_main(Clubs)
Sports = sports_main(Sports)
Leadership = leadership_main(Leadership)
from src.utils.extract_keypoints_skills import find_similarity
Skills = find_similarity(Skills)
from src.utils.extract_keypoints_purpose import find_similarity
purpose = find_similarity(purpose)
from src.utils.extract_keypoints_company import find_similarity
company = find_similarity(company)

If output (`Selected`) is `1` that means *user* is **selected** else **not selected**

In [59]:
model.predict([[Age, Gender, LinkedIn, Professional_Qualification, Clubs, Sports, Leadership, Skills, work_exp ,purpose, company, function, source, _class]])

array([0])