In [2]:
import pandas as pd 
import numpy as np

# Explortory Data Analysis - Marketing Data

This next one comes from: 
https://www.kaggle.com/datasets/ashydv/leads-dataset

by Ashish

In [14]:
leads = pd.read_csv("Marketing_Analysis\Leads.csv")
leads.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,...,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,...,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,...,No,Select,Mumbai,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,...,No,Select,Mumbai,02.Medium,01.High,15.0,18.0,No,No,Modified


In [16]:
leads.duplicated().any()

False

Checked for duplicate rows in the data - there are no duplicate rows

In [59]:
len(leads.columns)

37

In [60]:
len(leads)

9240

In [46]:
lead_na = leads.isna().sum()

This counts the number of NaN values in each of the columns.

In [75]:
lead_na.head(20)

Prospect ID                                         0
Lead Number                                         0
Lead Origin                                         0
Lead Source                                        36
Do Not Email                                        0
Do Not Call                                         0
Converted                                           0
TotalVisits                                       137
Total Time Spent on Website                         0
Page Views Per Visit                              137
Last Activity                                     103
Country                                          2461
Specialization                                   1438
How did you hear about X Education               2207
What is your current occupation                  2690
What matters most to you in choosing a course    2709
Search                                              0
Magazine                                            0
Newspaper Article           

In [74]:
lead_na.tail(18)

X Education Forums                             0
Newspaper                                      0
Digital Advertisement                          0
Through Recommendations                        0
Receive More Updates About Our Courses         0
Tags                                        3353
Lead Quality                                4767
Update me on Supply Chain Content              0
Get updates on DM Content                      0
Lead Profile                                2709
City                                        1420
Asymmetrique Activity Index                 4218
Asymmetrique Profile Index                  4218
Asymmetrique Activity Score                 4218
Asymmetrique Profile Score                  4218
I agree to pay the amount through cheque       0
A free copy of Mastering The Interview         0
Last Notable Activity                          0
dtype: int64

The information that has no NaN, and has information in all rows: 
- Prospect ID (unique customer ID)
- Lead Number (Lead number assigned to each lead procured)
- Lead Origin (Origin Identifier with which customer was identified to be a lead/ API, Landing Page Submission ...)
- Do Not Email (Opt in or no)
- Do Not Call (Opt in or no)
- Converted (Target Variable - was the lead successfully converted or not?)
- Total Time Spent on Website (Total time spent on website)
- Search* 
- Magazine* 
- Newspaper Article* 
- X Education Forums* 
- Newpaper* 
- Digital Advertisement* 
- Through Recommendations (indicates whether customer came through recommendations)
- Receive More Updates About Our Courses (opt in or not)
- Update me on Supply Chain Content (opt in or not)
- Get updates on DM Content (opt in or not)
- I agree to pay the amount through cheque (Has agreed to pay by check or not)
- A free copy of Mastering The Interview (opt in or not)
- Last Notable Activity (Last notable activity performed by customer)

'*' Indicates whether the customer had seen the ad in one of these items


Assessing the columns with NaNs, some of them have a relatively low number of missing values: 
- Lead Source (source of the lead: google, organic search, chat)
- Total Visits (Number of vitis by the customer on the website)
- Page Views Per visit (Average number of pages on the site viewed per visit)
- Last Activity (Last activity performed by customer - opening email button, chat conversation... )

The middle range, missing in about 1/8th to 1/4th of the dataset: 
- City (obvious)
- Lead Profile (A lead level assigned to each customer based on their profile)
- Country (obvious)
- Specialization (Industry domain where customer worked before. 'Select Specilization' also means the customer didn't select)
- How did you hear about X Education (Source where heard about X Ed)
- What is your current occupation (Indicates whether student, employed, or unemployed)
- What matters most to you in choosing a course (Option selected indicating main motto for joining course)

The high numbers missing, 1/3 to 1/2: 
- Tags (assigned to customers indicating the status of their lead)
- Lead Quality (Indicated quality of the lead based on data and intuition of the employee assigned to the lead)
- Asymmetric activity Index**
- Asymmetric Profile Index**
- Asymmetric Activity Score**
- Asymmetric Profile Score**

**An index and score assigned to each customer based on their activity and their profile. 


--- 
## Considerations
Target Variable: *Conversions*

Dependencies: Asymmetric activity may be dependent on the views, visits, tags and activity by the customer. Likewise the asymmetric profile may be dependent on lead profile, occupation, and other personal information already considered in the information.

If a model is constructed, only independent data should be considered, since data with dependencies will only reinforce the findings.

#### At first glance...
The columns that seem important to ensure data fidelity in analysis: 
- Lead Origin
- Total Time Spent on Website
- Last Notable Activity
- Lead Source
- Total Visits
- Page Views per Visit
- Last Activity
- Lead Profile (though this is a bit arbitrary and may include a lot of bias!)
- Specialization
- What matters most to you in choosing a course
- Lead Quality

Areas to consider bias
- Lead Profile 
- Lead Quality

#### Changes to make
Is there overlap in the
- Search* 
- Magazine* 
- Newspaper Article* 
- X Education Forums* 
- Newpaper* 
- Digital Advertisement*
? These seem like they could have been a drop menu, but that some clients may have selected more than one. Should investigate the setup. 

In [77]:
leads.nunique().head(20)

Prospect ID                                      9240
Lead Number                                      9240
Lead Origin                                         5
Lead Source                                        21
Do Not Email                                        2
Do Not Call                                         2
Converted                                           2
TotalVisits                                        41
Total Time Spent on Website                      1731
Page Views Per Visit                              114
Last Activity                                      17
Country                                            38
Specialization                                     19
How did you hear about X Education                 10
What is your current occupation                     6
What matters most to you in choosing a course       3
Search                                              2
Magazine                                            1
Newspaper Article           

In [78]:
leads.nunique().tail(18)

X Education Forums                           2
Newspaper                                    2
Digital Advertisement                        2
Through Recommendations                      2
Receive More Updates About Our Courses       1
Tags                                        26
Lead Quality                                 5
Update me on Supply Chain Content            1
Get updates on DM Content                    1
Lead Profile                                 6
City                                         7
Asymmetrique Activity Index                  3
Asymmetrique Profile Index                   3
Asymmetrique Activity Score                 12
Asymmetrique Profile Score                  10
I agree to pay the amount through cheque     1
A free copy of Mastering The Interview       2
Last Notable Activity                       16
dtype: int64

First, I want to investigate the columns with 1 value.

In [136]:
print(leads['Magazine'].unique(),
      leads['Receive More Updates About Our Courses'].unique(), 
      leads['Update me on Supply Chain Content'].unique(), 
      leads['Get updates on DM Content'].unique(),
      leads['I agree to pay the amount through cheque'].unique())


['No'] ['No'] ['No'] ['No'] ['No']


All of these columns yielded a single answer of No, so they add no value to the model and should be eliminated.

Next, I want to investigate the columns with 2 values, just to make sure that they are indeed yes/no answers, or to see if it's a single answer and an NaN.

In [86]:
print(leads['Do Not Email'].unique(),
      leads['Do Not Call'].unique(),
      leads['Converted'].unique(),
      leads['Search'].unique(),
      leads['Newspaper Article'].unique(),
      leads['X Education Forums'].unique(),
      leads['Newspaper'].unique(),
      leads['Digital Advertisement'].unique(),
      leads['Through Recommendations'].unique(),
      leads['A free copy of Mastering The Interview'].unique())


['No' 'Yes'] ['No' 'Yes'] [0 1] ['No' 'Yes'] ['No' 'Yes'] ['No' 'Yes'] ['No' 'Yes'] ['No' 'Yes'] ['No' 'Yes'] ['No' 'Yes']


Indeed all of the inputs are No/Yes. The one exception in these answers is the Converted output, which is in binary, 0 and 1. 

The others have substantially more values. For those over 30, the values seem to be identifiers or correlate with the number of NaN values present. 

In [118]:
leads['Lead Origin'].value_counts()


Landing Page Submission    4886
API                        3580
Lead Add Form               718
Lead Import                  55
Quick Add Form                1
Name: Lead Origin, dtype: int64

In [119]:
leads['Lead Source'].value_counts()


Google               2868
Direct Traffic       2543
Olark Chat           1755
Organic Search       1154
Reference             534
Welingak Website      142
Referral Sites        125
Facebook               55
bing                    6
google                  5
Click2call              4
Press_Release           2
Social Media            2
Live Chat               2
youtubechannel          1
testone                 1
Pay per Click Ads       1
welearnblog_Home        1
WeLearn                 1
blog                    1
NC_EDM                  1
Name: Lead Source, dtype: int64

The only issue I see here is the capitalized and lower case Google, which will have to be adjusted so they are the same source.

In [121]:
leads['Specialization'].value_counts()


Select                               1942
Finance Management                    976
Human Resource Management             848
Marketing Management                  838
Operations Management                 503
Business Administration               403
IT Projects Management                366
Supply Chain Management               349
Banking, Investment And Insurance     338
Travel and Tourism                    203
Media and Advertising                 203
International Business                178
Healthcare Management                 159
Hospitality Management                114
E-COMMERCE                            112
Retail Management                     100
Rural and Agribusiness                 73
E-Business                             57
Services Excellence                    40
Name: Specialization, dtype: int64

Select is the same as nan, so it should be set as such. With the 1438 nan values, this will make about half of the entries be nan. This also appears to be a drop menu, so E-COMMERCE is different from E-Business

In [122]:
leads['How did you hear about X Education'].value_counts()


Select                   5043
Online Search             808
Word Of Mouth             348
Student of SomeSchool     310
Other                     186
Multiple Sources          152
Advertisements             70
Social Media               67
Email                      26
SMS                        23
Name: How did you hear about X Education, dtype: int64

Again, the Select is nan, which will remove even more of this data. Should other be considered the same as NaN? 

In [123]:
leads['What is your current occupation'].value_counts()

Unemployed              5600
Working Professional     706
Student                  210
Other                     16
Housewife                 10
Businessman                8
Name: What is your current occupation, dtype: int64

This is only supposed to have 3 options: student, employed, unemployed. 
In this case, Other should be considered nan. 
Working Profession and businessman should be considered employed
Housewife should be considered unemployed.

How did this nonsense happen?!?

In [124]:
leads['What matters most to you in choosing a course'].value_counts()

Better Career Prospects      6528
Flexibility & Convenience       2
Other                           1
Name: What matters most to you in choosing a course, dtype: int64

Looking at this, it basically appears that everyone who answered is looking for Better Career Prospects or didn't answer at all. This won't add any value to the model, so this should be removed. 

Will revert after reading the email                  2072
Ringing                                              1203
Interested in other courses                           513
Already a student                                     465
Closed by Horizzon                                    358
switched off                                          240
Busy                                                  186
Lost to EINS                                          175
Not doing further education                           145
Interested  in full time MBA                          117
Graduation in progress                                111
invalid number                                         83
Diploma holder (Not Eligible)                          63
wrong number given                                     47
opp hangup                                             33
number not provided                                    27
in touch with EINS                                     12
Lost to Others

In [126]:
leads['Lead Quality'].value_counts()


Might be             1560
Not Sure             1092
High in Relevance     637
Worst                 601
Low in Relevance      583
Name: Lead Quality, dtype: int64

This seems like the most biased, and may be strongly influenced by the experience of the person who set the lead quality.

That also makes this a great list to investigate the predictability of the lead quality - did those rated poorly relate to low conversions and vice versa?

In [127]:
leads['Lead Profile'].value_counts()

Select                         4146
Potential Lead                 1613
Other Leads                     487
Student of SomeSchool           241
Lateral Student                  24
Dual Specialization Student      20
Name: Lead Profile, dtype: int64

In [128]:
leads['City'].value_counts()

Mumbai                         3222
Select                         2249
Thane & Outskirts               752
Other Cities                    686
Other Cities of Maharashtra     457
Other Metro Cities              380
Tier II Cities                   74
Name: City, dtype: int64

In [129]:
leads['Country'].value_counts()

India                   6492
United States             69
United Arab Emirates      53
Singapore                 24
Saudi Arabia              21
United Kingdom            15
Australia                 13
Qatar                     10
Hong Kong                  7
Bahrain                    7
Oman                       6
France                     6
unknown                    5
South Africa               4
Nigeria                    4
Germany                    4
Kuwait                     4
Canada                     4
Sweden                     3
China                      2
Asia/Pacific Region        2
Uganda                     2
Bangladesh                 2
Italy                      2
Belgium                    2
Netherlands                2
Ghana                      2
Philippines                2
Russia                     1
Switzerland                1
Vietnam                    1
Denmark                    1
Tanzania                   1
Liberia                    1
Malaysia      

Considering the 37 countries, the cities seems to be a useless predictor, as these are only cities in India, which is only one of the countries of the 37. Furthermore, Other Cities/Other Metro Cities and Tier II cities may have overlap by other definitions. This seems like a poorly defined parameter for an international program.

In [130]:
leads['Asymmetrique Activity Index'].value_counts()

02.Medium    3839
01.High       821
03.Low        362
Name: Asymmetrique Activity Index, dtype: int64

In [131]:
leads['Asymmetrique Profile Index'].value_counts()

02.Medium    2788
01.High      2203
03.Low         31
Name: Asymmetrique Profile Index, dtype: int64

In [132]:
leads['Asymmetrique Activity Score'].value_counts()

14.0    1771
15.0    1293
13.0     775
16.0     467
17.0     349
12.0     196
11.0      95
10.0      57
9.0        9
18.0       5
8.0        4
7.0        1
Name: Asymmetrique Activity Score, dtype: int64

In [133]:
leads['Asymmetrique Profile Score'].value_counts()

15.0    1759
18.0    1071
16.0     599
17.0     579
20.0     308
19.0     245
14.0     226
13.0     204
12.0      22
11.0       9
Name: Asymmetrique Profile Score, dtype: int64

In [138]:
leads['Tags'].value_counts()


Will revert after reading the email                  2072
Ringing                                              1203
Interested in other courses                           513
Already a student                                     465
Closed by Horizzon                                    358
switched off                                          240
Busy                                                  186
Lost to EINS                                          175
Not doing further education                           145
Interested  in full time MBA                          117
Graduation in progress                                111
invalid number                                         83
Diploma holder (Not Eligible)                          63
wrong number given                                     47
opp hangup                                             33
number not provided                                    27
in touch with EINS                                     12
Lost to Others

This looks like information collected by admissions via phone, which means it is already beyond the website phase. The NaN values here cannot be excluded because it means that there hasn't been a conversation yet. In fact, I should add a binary column for tags assigned, yes/no. If a tag has been assigned, the customers are likely much closer to conversion. 

**I'd like to see if there are any without tags that get converted.**

In [135]:
leads['Last Activity'].value_counts()

Email Opened                    3437
SMS Sent                        2745
Olark Chat Conversation          973
Page Visited on Website          640
Converted to Lead                428
Email Bounced                    326
Email Link Clicked               267
Form Submitted on Website        116
Unreachable                       93
Unsubscribed                      61
Had a Phone Conversation          30
Approached upfront                 9
View in browser link Clicked       6
Email Received                     2
Email Marked Spam                  2
Visited Booth in Tradeshow         1
Resubscribed to emails             1
Name: Last Activity, dtype: int64

In [134]:
leads['Last Notable Activity'].value_counts()

Modified                        3407
Email Opened                    2827
SMS Sent                        2172
Page Visited on Website          318
Olark Chat Conversation          183
Email Link Clicked               173
Email Bounced                     60
Unsubscribed                      47
Unreachable                       32
Had a Phone Conversation          14
Email Marked Spam                  2
Approached upfront                 1
Resubscribed to emails             1
View in browser link Clicked       1
Form Submitted on Website          1
Email Received                     1
Name: Last Notable Activity, dtype: int64

This is likely overlapping with the last activity. Definitely the case with the low number values, as they are exactly the same for some of them. Are these only collected from those who convert? Or maybe from those who make it to the Tags list?

**Which of these had the biggest difference between last activity and last notable activity?**
**Did the ones with high differences lead to fewer conversions?**

I'm just looking at the Olark Chat. There were 973 last activity, but only 183 last notable conversation. How many total of these last notable conversations led to conversion vs. how many last conversations were had in total. 

Is there an impact on what is driving customers forward vs. driving them away? 

---

# Next Steps - Data Cleaning
1. Remove the pointless columns
    - Magazine
    - Receive more updates about our courses
    - Update me on supply 
    - Get updates on DM
    - I agree to pay the amount through check 
    - What matters to you most in choosing a course
    - City <br><br>


2. Fix the Lead Source data
    - Google and google are both used. There are only 5 values under the lowercase, but this may have some influence
    - welearnblog_home and welearn are alos the same, even though these are only 2 values <br><br>


3. Fix Specialization, How did you hear about X Education, Lead Profile, 
    - Select is NaN <br><br>


4. Fix What is your current occupation
    - There should only be 3: student, employed, unemployed
    - Working Professional and businessman should be employed
    - Housewife should be unemployed
    - Other in this case is nan, since it cannot be classified at all
<br><br>

5. Asymmetric Indexes
    - Remove the first 3 characters from each, or just change to Low, Medium, High
<br><br>

--- 
# Analysis
1. Analyzing Sources: Newspaper, Newspaper Article, X Ed Forums, Digital Advertisement, Recommendation <br><br>
2. Look at the Tags, Last Activity, and Last Notable Activity for patterns
3. Patterns with the Visits, Time, Views, etc on Conversion
    - Look at the Bayesian Statistical Analysis approach
    - Consider priors, eliminate data with dependencies
    - Regression? 
4. ML Model? 

