# Business Understanding

## Overview of The Business Problem
- **Our client is an IT educational institute. They have reached out to us has reach out with the following:**

    - **IT jobs and technologies keep evolving quickly. This makes our field to be one of the most interesting out there. But on the other hand, such fast development confuses our students. They do not know which skills they need to learn for which job.**

    - `Do I need to learn C++ to be a Data Scientist?`
    - `Do DevOps and System admins use the same technologies?` 
    - `I really like JavaScript; can I use it in Data Analytics?`

## Business Objective
- **Develop a Data-Driven solution for our students to answer such questions.**
- **Understand the relationships between the jobs and the technologies.**

## KPIs
- **Higher enrollment rate due to higher certainty** 
- **Decrease in drop-out rate**
- **Time saved for the academic advisors**

## Frame the Problem
- **Given we have a labeled Dataset and the labels consists of multiple values `DevType`.**
- **This is a Supervised MultiClass Classification Task.**
- **Also there is no continous flow of data coming into the system, So we are going to use Batch learning.**

## Performance Measure
- **Since we are dealing with a classification problem, We can use Accuracy, Confusion matrix, Precision, Recall, F1-score**
- **Incase of a imblanced dataset, Accuracy won't be a good option.** 

## Data Source
- **We will work on Stack Overflow Developers Survey Dataset of 2022**

# Data Understanding

In [1]:
# Constants
DATA_PATH = '../Data/Raw/survey_results_public2022.csv'

In [2]:
# Load packages
import numpy as np
import pandas as pd
import logging
pd.options.display.max_rows = 10000
pd.options.display.max_columns = 10000

### Functions

In [15]:
def print_unique_values(df, columns):
    """
    Print the unique values for each categorical column and there count in the DataFrame.

    Args:
        df (DataFrame): DataFrame containing categorical columns.
        columns (list): Array of column names to loop through.

    Returns:
        None
    """
    for col in columns:
        value_counts = df[col].value_counts().head(5)
        unique_count = len(df[col].unique())
        print(f"Unique values of {col}:\nNo. of Unique values: {unique_count}\n{value_counts}' \n")

_________

In [4]:
# Read data and print shape
raw_df = pd.read_csv(DATA_PATH)
raw_df.shape

(73268, 79)

- The Dataset Consists of 73268 rows and 79 columns

In [5]:
raw_df.columns

Index(['ResponseId', 'MainBranch', 'Employment', 'RemoteWork',
       'CodingActivities', 'EdLevel', 'LearnCode', 'LearnCodeOnline',
       'LearnCodeCoursesCert', 'YearsCode', 'YearsCodePro', 'DevType',
       'OrgSize', 'PurchaseInfluence', 'BuyNewTool', 'Country', 'Currency',
       'CompTotal', 'CompFreq', 'LanguageHaveWorkedWith',
       'LanguageWantToWorkWith', 'DatabaseHaveWorkedWith',
       'DatabaseWantToWorkWith', 'PlatformHaveWorkedWith',
       'PlatformWantToWorkWith', 'WebframeHaveWorkedWith',
       'WebframeWantToWorkWith', 'MiscTechHaveWorkedWith',
       'MiscTechWantToWorkWith', 'ToolsTechHaveWorkedWith',
       'ToolsTechWantToWorkWith', 'NEWCollabToolsHaveWorkedWith',
       'NEWCollabToolsWantToWorkWith', 'OpSysProfessional use',
       'OpSysPersonal use', 'VersionControlSystem', 'VCInteraction',
       'VCHostingPersonal use', 'VCHostingProfessional use',
       'OfficeStackAsyncHaveWorkedWith', 'OfficeStackAsyncWantToWorkWith',
       'OfficeStackSyncHaveWork

In [8]:
# Display random row 
raw_df.sample(1).iloc[0]

ResponseId                                                                    32125
MainBranch                                           I am a developer by profession
Employment                                                      Employed, full-time
RemoteWork                                     Hybrid (some remote, some in-person)
CodingActivities                                                              Hobby
EdLevel                                Bachelor’s degree (B.A., B.S., B.Eng., etc.)
LearnCode                         Books / Physical media;Other online resources ...
LearnCodeOnline                   Technical documentation;Blogs;Stack Overflow;H...
LearnCodeCoursesCert                                                    Pluralsight
YearsCode                                                                        12
YearsCodePro                                                                      7
DevType                                                       Developer, ful

In [6]:
# Print the general information of the data frame 
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73268 entries, 0 to 73267
Data columns (total 79 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   ResponseId                      73268 non-null  int64  
 1   MainBranch                      73268 non-null  object 
 2   Employment                      71709 non-null  object 
 3   RemoteWork                      58958 non-null  object 
 4   CodingActivities                58899 non-null  object 
 5   EdLevel                         71571 non-null  object 
 6   LearnCode                       71580 non-null  object 
 7   LearnCodeOnline                 50685 non-null  object 
 8   LearnCodeCoursesCert            29389 non-null  object 
 9   YearsCode                       71331 non-null  object 
 10  YearsCodePro                    51833 non-null  object 
 11  DevType                         61302 non-null  object 
 12  OrgSize                         

- **Both `YearsCode` and `YearsCodePro` are in object type instead of int type, need to figureout why?**
- **Both `VCHostingPersonal use` and `VCHostingProfessional use` are in float type and have zero values, seems like the data for these two columns didn't get collected properly**
- **Due to the nature of the survey, the dataset has alot of missing values in each column**

In [12]:
# Investigate the questionable objects columns
questionable_cols = ['YearsCodePro', 'YearsCode']

for col in questionable_cols: 
    print(col)
    print(raw_df[col].unique().tolist())
    print('--------------------------')
    print()

YearsCodePro
[nan, '5', '17', '3', '6', '30', '2', '10', '15', '4', '22', '20', '40', '9', '14', '21', '7', '18', '25', '8', '12', '45', '1', '19', '28', '24', '11', '23', 'Less than 1 year', '32', '27', '16', '44', '26', '37', '46', '13', '31', '39', '34', '38', '35', '29', '42', '36', '33', '43', '41', '48', '50', 'More than 50 years', '47', '49']
--------------------------

YearsCode
[nan, '14', '20', '8', '15', '3', '1', '6', '37', '5', '12', '22', '11', '4', '7', '13', '36', '2', '25', '10', '40', '16', '27', '24', '19', '9', '17', '18', '26', 'More than 50 years', '29', '30', '32', 'Less than 1 year', '48', '45', '38', '39', '28', '23', '43', '21', '41', '35', '50', '33', '31', '34', '46', '44', '42', '47', '49']
--------------------------



In [13]:
# create list for numerical and categorical columns 
cat_cols = list(raw_df.select_dtypes(include=['object']).columns)
num_cols = list(raw_df.select_dtypes(exclude=['object']).columns)

In [11]:
#Print unique values of each columns and their count
print_unique_values(raw_df, cat_cols)

Unique values of MainBranch:
No. of Unique values: 6
I am a developer by profession                                                   53507
I am learning to code                                                             6309
I am not primarily a developer, but I write code sometimes as part of my work     5794
I code primarily as a hobby                                                       4865
None of these                                                                     1497
I used to be a developer by profession, but no longer am                          1296
Name: MainBranch, dtype: int64' 

Unique values of Employment:
No. of Unique values: 104
Employed, full-time                                                         42962
Student, full-time                                                           6756
Independent contractor, freelancer, or self-employed                         4978
Employed, full-time;Independent contractor, freelancer, or self-employed     3486
Not empl

- **Changes need to be done:**
    - **Change the text values in `YearsCode` and `YearsCodePro` to numerical values**
    - **Add `VCHostingPersonal use` and `VCHostingProfessional use` to the unuseful features until further notice.**
    - **The features that contain multiple answers in each row seperated by `;` need to be converted to a format we can easily manipulate** 
    - **The skills inside the features that end with `HaveWorkedWith` or `WantToWorkWith` need to be combined.** 

In [11]:
# Get stats for the numerical column
raw_df.describe()

Unnamed: 0,ResponseId,CompTotal,VCHostingPersonal use,VCHostingProfessional use,WorkExp,ConvertedCompYearly
count,73268.0,38422.0,0.0,0.0,36769.0,38071.0
mean,36634.5,2.342434e+52,,,10.242378,170761.3
std,21150.794099,4.591478e+54,,,8.70685,781413.2
min,1.0,0.0,,,0.0,1.0
25%,18317.75,30000.0,,,4.0,35832.0
50%,36634.5,77500.0,,,8.0,67845.0
75%,54951.25,154000.0,,,15.0,120000.0
max,73268.0,9e+56,,,50.0,50000000.0


- **There is no information about `ConvertedCompYearly` in the survey or how it was calculated.**

In [16]:
#Print unique values of each columns and their count
print_unique_values(raw_df, num_cols)

Unique values of ResponseId:
No. of Unique values: 73268
1        1
48844    1
48850    1
48849    1
48848    1
Name: ResponseId, dtype: int64' 

Unique values of CompTotal:
No. of Unique values: 3180
100000.0    980
150000.0    789
60000.0     752
120000.0    745
50000.0     733
Name: CompTotal, dtype: int64' 

Unique values of VCHostingPersonal use:
No. of Unique values: 1
Series([], Name: VCHostingPersonal use, dtype: int64)' 

Unique values of VCHostingProfessional use:
No. of Unique values: 1
Series([], Name: VCHostingProfessional use, dtype: int64)' 

Unique values of WorkExp:
No. of Unique values: 52
5.0    3029
3.0    2880
4.0    2713
2.0    2619
1.0    2469
Name: WorkExp, dtype: int64' 

Unique values of ConvertedCompYearly:
No. of Unique values: 7910
150000.0    393
200000.0    362
120000.0    341
63986.0     304
100000.0    279
Name: ConvertedCompYearly, dtype: int64' 



- **Go through the schema of the survey to identify each feature and the question it asks to identify its importance for our business case**
    - **Unuseful features:**
        - `ResponseId`, `SurveyEase`, `SurveyLength`
        - `TrueFalse_1` to `TrueFalse_3`, `Frequency_1` to `Frequency_3`, `Knowledge_1` to `knowledge_7`
        - `Onboarding`, `TimeSearching`, `TimeAnswering`, `ICorPM`, `TBranch`
        - `Trans`, `Sexuality`, `Ethnicity`, `Accessibility`, `MentalHealth`, `Age`, `Gender`, `Blockchain`    
        - `SOComm`, `NEWSOSites`, `SOVisitFreq`, `SOPartFreq`,`SOAccount`, `BuyNewTool`, `PurchaseInfluence`, 
        - `OfficeStackAsyncHaveWorkedWith`, `OfficeStackAsyncWantToWorkWith`, `OfficeStackSyncHaveWorkedWith`, `OfficeStackSyncWantToWorkWith`
        - `VCInteraction`,`VCHostingPersonal use` , `VCHostingProfessional use`
        - `OpSysProfessional use`,  `OpSysPersonal use`,

        
        
    - **Might be useful features:** 
        - `Employment`,`RemoteWork`,
        - `MainBranch`,`CodingActivities`, `ProfessionalTech`,
        - `LearnCode`, `LearnCodeOnline`, `LearnCodeCoursesCert`
        - `WorkExp`, `YearsCode`, `YearsCodePro`, `EdLevel`,
        - `OrgSize`, `Country`,
        - `ConvertedCompYearly`, `Currency`, `CompTotal`, `CompFreq`
        
    - **Core features:** 
        - `DevType`
        - `VersionControlSystem`,  
        - `LanguageHaveWorkedWith`,  `LanguageWantToWorkWith`,  
        - `DatabaseHaveWorkedWith`,  `DatabaseWantToWorkWith`,  
        - `PlatformHaveWorkedWith`,  `PlatformWantToWorkWith`,  
        - `WebframeHaveWorkedWith`,  `WebframeWantToWorkWith`,  
        - `MiscTechHaveWorkedWith`,`MiscTechWantToWorkWith`,  
        - `ToolsTechHaveWorkedWith`,`ToolsTechWantToWorkWith`,                                    
        - `NEWCollabToolsHaveWorkedWith`,  `NEWCollabToolsWantToWorkWith`,  