<a href="https://colab.research.google.com/github/Javier9898/Top100_Best_Clients_Data_Analysis/blob/master/Alpha_Coding_Challenge_JavierJimenez.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Alpha Coding Challenge - English**
by Javier Jiménez

## **Business Problem**
Alpha specializes in delivering technology solutions. One of their engagement models is IT Staff Augmentation, where they offer their engineers as part of an extension of the client´s teams. Approximately, 95% of their engineers are located in LATAM. Currently, this company is looking to expand their client portfolio by sending an email marketing campaign.

## **Problem Statement**
A file containing LinkedIn public data has been provided. Given this information, Alpha would like to know the top 100 people with the highest chance of becoming their client.

## **Data:**
The data used will be the one in the file provided: people.in

### **At the end of this notebook I'll mention:**

1.   Ways in which my algorithm could be improved.
2.   What additional data I would consider to be relevant to improve my algorithm.



---



---



## **Code**

In [4]:
#installing google translate library
!pip install googletrans



## Libraries

In [5]:
import numpy as np # Multi-dimensional arrays and matrices
import pandas as pd # Data manipulation

import re # Provides regular expression matching operations
import os # Provides functions for interacting with the operating system
from googletrans import Translator # Allows translation

## Loading The Data In

I'll load in a "people.in" file and transform it into a csv file so I'm able to open it with Python's Pandas library to manipulate the data and store it in a Dataframe.

In [6]:
# Loading the data in
file_path = 'people.in'
base = os.path.splitext(file_path)[0]
os.rename(file_path, base+'(in)' + '.csv')

people = pd.read_csv("people(in).csv", sep='|', header=None)

## Data Cleaning

The data came with no headers so I'll add them to the Dataframe.

In [7]:
# Adding column names to the dataframe
people.columns = ['PersonId', 'Name', 'LastName', 'CurrentRole', 'Country', 
                  'Industry', 'NumberOfRecommendations', 'NumberOfConnections']

Requesting a small sample to get an idea of what the dataframe looks like.

In [8]:
# Requesting a small sample of 3 rows
people.sample(3)

Unnamed: 0,PersonId,Name,LastName,CurrentRole,Country,Industry,NumberOfRecommendations,NumberOfConnections
701,641020902,kylie,chambers,media sales adviser,Australia,Information Services,0,0
807,638379440,jeny,cassady,professional puppeteer,Canada,Performing Arts,0,0
2143,642361227,bruce,clemmer,assistant director and oper&infra,Canada,Education,0,0


Next I'm going to check if there are any anomalies.

In [9]:
#checking the data info for any anomalies
people.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2779 entries, 0 to 2778
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   PersonId                 2779 non-null   int64 
 1   Name                     2778 non-null   object
 2   LastName                 2776 non-null   object
 3   CurrentRole              2155 non-null   object
 4   Country                  2777 non-null   object
 5   Industry                 2779 non-null   object
 6   NumberOfRecommendations  2779 non-null   int64 
 7   NumberOfConnections      2779 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 173.8+ KB


I´ll drop the Name and LastName columns since these factors don't significally affect the quality of a client.

In [10]:
# Dropping Name and LastName columns
people = people.drop(columns= ['Name', 'LastName'])

I'll be deleteing all the rows containing NaNs since they only come from **CurrentRole** and **Country**. Nan values in the CurrentRole column could mean these people are currently unemployed in their industry and probably don't require Alpha's services. The reason why I'll delete NaN values in the Country column is because of the miniscule amount of unanswered values that are in it. Deleting these NaN values will reduce the size of the dataset and might help me get results faster while filtering it to choose the best clients.

In [11]:
# Deleteing all the rows containing NaNs
people = people.dropna()

It is important to make sure that all the data is in the same language. With the help of the Google Translate library I´ll be able to translate the all data to english if some of it previously wasn't.

In [12]:
# making sure the data is in the same language

translator=Translator()

translations = {}
for column in people[["CurrentRole", "Industry"]]:
    # unique elements of the column
    unique_elements = people[column].unique()
    for element in unique_elements:
        # add translation to the dictionary
        translations[element] = translator.translate(element).text
    
#print(translations) # to make sure everything is adequately transalted

In [13]:
#  Applying the translations to the Dataframe
people.replace(translations, inplace = True)

Having a last look at the data info to make sure its clean and appropiate to work with.

In [14]:
#Checking for any anomalies
people.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2153 entries, 0 to 2778
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   PersonId                 2153 non-null   int64 
 1   CurrentRole              2153 non-null   object
 2   Country                  2153 non-null   object
 3   Industry                 2153 non-null   object
 4   NumberOfRecommendations  2153 non-null   int64 
 5   NumberOfConnections      2153 non-null   int64 
dtypes: int64(3), object(3)
memory usage: 117.7+ KB


Having a look at the cleaned Dataframe I'll work with to gather the results.

In [15]:
# Requesting a small sample
people.sample(3)

Unnamed: 0,PersonId,CurrentRole,Country,Industry,NumberOfRecommendations,NumberOfConnections
1526,642868966,board member,United States,Education,0,0
703,637984874,acting sales manager,Australia,Information Services,0,0
1792,642233235,owner,United Kingdom,Hospitality,0,0


## Data Exploration

Checking the unique values in **CurrentRole**, **Country** and **Industry** will allow me to gain more insight and clarity to make the correct decisions while choosing the best clients for Alpha's client portfolio.

Checking unique roles of these people and their total amount.

In [16]:
# Requesting to see unique roles of these people and their total amount.
print(people.CurrentRole.unique())
print(" ")
print("Unique values: " + str(people.CurrentRole.nunique()))

['vice president' 'chief revenue officer'
 'vp, customer operations and support' ...
 'child care eligibility supervisor' 'freelancer' 'channel management']
 
Unique values: 1607


There are many roles but only a few have the power to request or recommend Alpha's services.

Checking the unique countries were these people are from and the total amount of these.

In [17]:
# Requesting to see the unique countries were these people are from
# and the total amount of these.
print(people.Country.unique())
print(" ")
print("Unique values: " + str(people.Country.nunique()))

['Dominica' 'United States' 'Canada' 'Spain' 'India'
 'United Arab Emirates' 'United Kingdom' 'Turkey' 'Germany' 'Bangladesh'
 'Costa Rica' 'Mexico' 'Australia' 'China' 'Israel' 'Italy' 'France'
 'Netherlands' 'Sweden' 'Japan' 'Switzerland' 'Argentina' 'Chile' 'Poland'
 'Belgium' 'Singapore' 'Korea' 'Malta' 'Portugal' 'Brazil' 'South Africa'
 'Ireland' 'Colombia' 'Hong Kong' 'Denmark' 'Cyprus' 'Saudi Arabia'
 'Taiwan' 'Slovak Republic' 'Finland' 'Norway' 'Czech Republic' 'Kuwait'
 'Qatar' 'New Zealand' 'Hungary' 'Malaysia' 'Romania' 'Greece']
 
Unique values: 49


The fact that close to 95% of Alpha's developers are from LATAM and the data I've been given is so diverse in locations, makes me believe that Alpha offers a remote working experience to their clients. 

Checking the unique industries were these people are from and the total amount of these.

In [18]:
# Requesting to see the unique industries were these people are from
# and the total amount of these.
print(people.Industry.unique())
print(" ")
print("Unique values: " + str(people.Industry.nunique()))

['Telecommunications' 'Publishing' 'Computer Software' 'Electronics'
 'Investment Banking' 'Internet' 'Business Services' 'Oil & Energy'
 'Information Technology and Services' 'Renewables & Environment'
 'Consumer Electronics' 'Management Consulting' 'Insurance' 'Banking'
 'Food & Beverages' 'Manufacturing' 'Museums and Institutions'
 'Automobiles' 'Nonprofit Organization Management' 'Education'
 'Media Production' 'Brokerage' 'Transportation/Trucking/Railroad'
 'Hospitality' 'Marketing and Advertising' 'Test & Measurement Equipment'
 'Office Products' 'Construction' 'Retail' 'Legal Services' 'Design'
 'Computer Hardware' 'Hospital & Health Care' 'Furniture' 'Consumer Goods'
 'Utilities' 'Security Products & Services' 'Biotechnology'
 'Pharmaceuticals' 'Building Materials' 'Chemicals' 'Consumer Services'
 'Boats & Submarines' 'Newspapers' 'Real Estate' 'Financial Services'
 'Information Services' 'Libraries' 'Computer & Network Security'
 'Toys & Games' 'Education Management' 'Tobacco'

While every industry could benefit from technology, not every industry requires or is neccesarily actively looking for IT services.

## Choosing The Best Clients

I will define a good client as one who is part of an industry relevant to technology and has a current role capable of hiring or recommending an IT staff augmentation to the company. I'll also take into consideration the country where they are on, since in some countries IT is more common and sought. Lastly I'll rank the remaining people by their number of connections and recommendations since these can impact the company's popularity upon a recommendation from the client. With this data I can create a new Dataframe that contains only the selected people for a better handling of information.

Filtering data by people who work in industries relevant to IT.

In [19]:
# Industires where IT staff augmentation may most likely be wanted

relevant_industries=['Telecommunications', 'Computer Software',
       'Investment Banking', 'Internet',
       'Information Technology and Services',
       'Management Consulting', 'Nonprofit Organization Management', 
       'Computer Hardware', 'Biotechnology',
       'Consumer Services', 'Information Services',
       'Computer & Network Security',
       'Program Development','Staffing and Recruiting', 'Industrial Automation', 
       'E-Learning','Logistics and Supply Chain', 
       'Government Administration', 'Public Safety', 
       'Computer Networking', 'Airlines/Aviation', 'Maritime', 'Research',
       'Civic & Social Organization', 'Online Media','Market Research',
       'Political Organization', 'Philanthropy', 'Military',
       'Higher Education','Computer Games', 'Wire & Cable', 'Banking']

In [20]:
# Creating a DataFrame that stores the rows of the desired industries
df1 = pd.DataFrame (columns = ["PersonId", "CurrentRole", "Country", "Industry", 
                               "NumberOfRecommendations", "NumberOfConnections"])

# Storing only the rows from the desired industries in the dataframe
for row in people.Industry:
    if row in (relevant_industries):
      df1 = df1.append(people[people['Industry']==row])

In [21]:
# dropping dupliacte results
df1 = df1.drop_duplicates(subset=['PersonId'])

In [22]:
# Checking how many people remain
df1.shape

(524, 6)

After filtering by Indsutries, 524 people remain...

Next, I'll filter the data by the people with roles who have the power to recommend or hire in their companies.

In [23]:
# Roles with the power to recommend or hire in their companies.

search_values = ['president', 'director', 'principal', 'manager', 'research', 
            'human resources', 'recruiter','founder', 'operations', 
            'coordinator', 'chief', 'executive']

In [24]:
# Leaving only the relevant role position rows in the dataframe
df1 = df1[df1.CurrentRole.str.contains('|'.join(search_values))]

In [25]:
df1.shape

(252, 6)

After filtering, only 251 people remain...

Next I'll filter the data by Countries that may be more accepting of technology and its benefits.

In [26]:
df1.Country.unique()

array(['Dominica', 'Spain', 'United States', 'United Kingdom', 'Sweden',
       'Canada', 'United Arab Emirates', 'Germany', 'France', 'Israel',
       'Ireland', 'South Africa', 'Hong Kong', 'Australia', 'Japan',
       'Colombia', 'Saudi Arabia', 'Singapore', 'Netherlands', 'Portugal',
       'Switzerland', 'New Zealand', 'China', 'Italy'], dtype=object)

In [27]:
relevant_countries =['Spain', 'United States', 'United Kingdom', 'Sweden',
       'Canada', 'United Arab Emirates', 'Germany', 'France', 'Israel',
       'Ireland', 'South Africa', 'Hong Kong', 'Australia', 'Japan',
       'Colombia', 'Saudi Arabia', 'Singapore', 'Netherlands', 'Portugal',
       'Switzerland', 'New Zealand', 'China', 'Italy']

In [28]:
# Storing only the rows from the desired countries in a Dataframe
df2 = pd.DataFrame (columns = ["PersonId", "CurrentRole", "Country", "Industry", 
                               "NumberOfRecommendations", "NumberOfConnections"])

for row in df1.Country:
    if row in (relevant_countries):
      df2 = df2.append(df1[df1['Country']==row])

In [29]:
# dropping dupliacte results
df2 = df2.drop_duplicates(subset=['PersonId'])

In [30]:
df2.shape

(251, 6)

After filtering, only 250 people remain...

Lastly, I'll sort people by their number of connections and recommendations since these can impact the company's popularity upon a recommendation from the client. These would be the best clients. I'm only taking the top 100 since its the amount required.

In [31]:
# Sorting people by their number of connections and recommendations

people_out = df2.sort_values(['NumberOfConnections', 'NumberOfRecommendations'], 
                   ascending=False).head(100)

## Taking a look at the Results

Requesting the top 5 clients

In [32]:
#Requesting the top 5
people_out.head(5)

Unnamed: 0,PersonId,CurrentRole,Country,Industry,NumberOfRecommendations,NumberOfConnections
14,85424165,president,Canada,Information Technology and Services,5,406
2,556570894,"vp, customer operations and support",United States,Computer Software,0,270
5,277449146,vice president of business administration,Spain,Telecommunications,0,0
271,639290956,key account manager,Spain,Telecommunications,0,0
10,344601083,vice president - studio media strategy and ope...,United States,Telecommunications,0,0


Making sure there is 100 rows in the Dataframe.

In [33]:
people_out.shape

(100, 6)

## Saving the Results

The PersonID will be the information saved in a file called "people.out" since its the only result required.

In [34]:
# Saving the PersonId since its the only information required
col_to_keep = ['PersonId']
people_out[col_to_keep].to_csv("people.csv", header=False, index = False)

# Changing the file format to a ".out" file
file_path = 'people.csv'
base = os.path.splitext(file_path)[0]
os.rename(file_path, base + '.out')

## **Ways in which my algorithm could be improved**

From the data I've been given, I would say im happy with the outcome, however, I could improve it by knowing exactly what industries are the most relevant for Alpha's services, which roles are responsable for requesting the type of services the company sells and what countries usually require and seek companies like Alpha. Having this information would've improved the quality of the results.

## **Additional data I would consider to be relevant to improve my algorithm**

There are many ways in which getting more data can improve the results. Having identical data from previous marketing campaigns done by the company, where clients were succesfully gathered, could be used in a machine learning model to find out who is more likely to contact the company. 
Having information about current or past clients would help compare, between these and the potential ones, to find connections between them. Additionally, having more columns of information would add value to the dataset. Having a column about years in the industry could lead to results where there are more clients who have more experience in their industries, may be easier to work with and are more willing and able to pay for more services. An Activity column, measured by posts per week the user makes, could communicate that the person is more likely to recommend Alpha's services, ranking him higher in the list of potential clients.