# Business Understanding

## Overview of The Business Problem
- **Our client is an IT educational institute. They have reached out to us has reach out with the following:**

    - **IT jobs and technologies keep evolving quickly. This makes our field to be one of the most interesting out there. But on the other hand, such fast development confuses our students. They do not know which skills they need to learn for which job.**

    - `Do I need to learn C++ to be a Data Scientist?`
    - `Do DevOps and System admins use the same technologies?` 
    - `I really like JavaScript; can I use it in Data Analytics?`

## Business Objective
- **Develop a Data-Driven solution for our students to answer such questions.**
- **Understand the relationships between the jobs and the technologies.**

## KPIs
- **Higher enrollment rate due to higher certainty** 
- **Decrease in drop-out rate**
- **Time saved for the academic advisors**

## How How will your solution be used? 

## Current Solution

## Frame the Problem
- How should you frame this problem (supervised/unsupervised, online/offline, etc.)

## Performance Measure
Since we are dealing with a classification problem, We can use **Accuracy, Confusion matrix, Precision, Recall, F1-score** <br>
Incase of a imblanced dataset, **Accuracy** won't be a good option. <br>

## Is the performance measure aligned with the business objective?

## What would be the minimum performance needed to reach the business objective?

## Data Source
- **We will work on Stack Overflow Developers Survey Dataset of 2022**

- Is the performance measure aligned with the business objective?
- What would be the minimum performance needed to reach the business objective?
- What are comparable problems? Can you reuse experience or tools?
- Is human expertise available?
- How would you solve the problem manually?
- List the assumptions you or others have made so far.
- Verify assumptions if possible.

# Data Understanding

In [9]:
# Constants
DATA_PATH = '../Data/Raw/survey_results_public2022.csv'
# DATA_PATH1 = '../Data/Raw/survey_results_public2021.csv'
# DATA_PATH2 = '../Data/Raw/survey_results_public2023.csv'

In [10]:
# Load packages
import numpy as np
import pandas as pd
import logging
pd.options.display.max_rows = 10000
pd.options.display.max_columns = 10000

In [11]:
# Read data and print shape
raw_df = pd.read_csv(DATA_PATH)
raw_df.shape

(73268, 79)

In [12]:
raw_df.columns

Index(['ResponseId', 'MainBranch', 'Employment', 'RemoteWork',
       'CodingActivities', 'EdLevel', 'LearnCode', 'LearnCodeOnline',
       'LearnCodeCoursesCert', 'YearsCode', 'YearsCodePro', 'DevType',
       'OrgSize', 'PurchaseInfluence', 'BuyNewTool', 'Country', 'Currency',
       'CompTotal', 'CompFreq', 'LanguageHaveWorkedWith',
       'LanguageWantToWorkWith', 'DatabaseHaveWorkedWith',
       'DatabaseWantToWorkWith', 'PlatformHaveWorkedWith',
       'PlatformWantToWorkWith', 'WebframeHaveWorkedWith',
       'WebframeWantToWorkWith', 'MiscTechHaveWorkedWith',
       'MiscTechWantToWorkWith', 'ToolsTechHaveWorkedWith',
       'ToolsTechWantToWorkWith', 'NEWCollabToolsHaveWorkedWith',
       'NEWCollabToolsWantToWorkWith', 'OpSysProfessional use',
       'OpSysPersonal use', 'VersionControlSystem', 'VCInteraction',
       'VCHostingPersonal use', 'VCHostingProfessional use',
       'OfficeStackAsyncHaveWorkedWith', 'OfficeStackAsyncWantToWorkWith',
       'OfficeStackSyncHaveWork

In [13]:
# Display random row 
raw_df.sample(1).iloc[0]

ResponseId                                                                     9384
MainBranch                                           I am a developer by profession
Employment                                                      Employed, part-time
RemoteWork                                     Hybrid (some remote, some in-person)
CodingActivities                                       I don’t code outside of work
EdLevel                             Master’s degree (M.A., M.S., M.Eng., MBA, etc.)
LearnCode                         Books / Physical media;School (i.e., Universit...
LearnCodeOnline                                                                 NaN
LearnCodeCoursesCert                                                            NaN
YearsCode                                                                        17
YearsCodePro                                                                      9
DevType                                   Developer, full-stack;Developer, b

In [14]:
# Print the general information of the data frame 
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73268 entries, 0 to 73267
Data columns (total 79 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   ResponseId                      73268 non-null  int64  
 1   MainBranch                      73268 non-null  object 
 2   Employment                      71709 non-null  object 
 3   RemoteWork                      58958 non-null  object 
 4   CodingActivities                58899 non-null  object 
 5   EdLevel                         71571 non-null  object 
 6   LearnCode                       71580 non-null  object 
 7   LearnCodeOnline                 50685 non-null  object 
 8   LearnCodeCoursesCert            29389 non-null  object 
 9   YearsCode                       71331 non-null  object 
 10  YearsCodePro                    51833 non-null  object 
 11  DevType                         61302 non-null  object 
 12  OrgSize                         

In [15]:
# Check for duplicates
raw_df.duplicated().value_counts()

False    73268
dtype: int64

In [16]:
# create list for numerical and categorical columns 
cat_cols = list(raw_df.select_dtypes(include=['object']).columns)
num_cols = list(raw_df.select_dtypes(exclude=['object']).columns)

In [17]:
def print_value_counts(df, columns):
    for col in columns:
        value_counts = df[col].value_counts()[:50]
        unique_count = len(df[col].unique())
        print(f"Value counts of {col}:\nNo. of Unique values: {unique_count}\n{value_counts}' \n")

In [None]:
# Explore the categories and no. of unique values of each categorical column
print_value_counts(raw_df, cat_cols)

In [None]:
# Get stats for the numerical column
raw_df.describe()

In [None]:
# Investigate the questionable objects columns
questionable_cols = ['YearsCodePro', 'YearsCode']

for col in questionable_cols: 
    print(col)
    print(raw_df[col].unique().tolist())
    print('--------------------------')
    print()

In [None]:
def print_unique_values(encoded_df, columns):
    """
    Print the unique values for each categorical column in the DataFrame.

    Args:
        encoded_df (DataFrame): Encoded DataFrame containing categorical columns.
        columns (list): Array of column names to loop through.

    Returns:
        None
    """
    for col in columns:
        unique_values = encoded_df[col].unique()
        print(f"Unique values of '{col}': {unique_values}\n")