# **Data Wrangling**

## Repository

Check out the repository on [GitHub](https://github.com/FaiLuReH3Ro/data-wrangling-py) for more details.


## Dataset Used

[Stack Overflow Survey Data 2024 Subset](https://www.kaggle.com/datasets/failureh3ro/stack-overflow-survey-data-2024-subset/data)

## Objectives

* Identify duplicate rows in the dataset
* Remove duplicate rows and verify the removal 
* Find columns with missing values
* Impute the missing values
* Perform Data Normalizing for certain columns

## Download and Import Libraries

In [1]:
# Run this cell if the libraries are not installed yet
# Uncomment the lines below to install
# %pip install pandas
# %pip install numpy

In [2]:
# Importing the pandas and numpy libraries
import pandas as pd
import numpy as np

# Suppress warnings
# Comment before running to view warnings
import warnings
warnings.filterwarnings("ignore")

## Loading the Data

In [3]:
# Read the CSV file into a dataframe
url = '/kaggle/input/stack-overflow-survey-data-2024-subset/survey_data.csv'
data = pd.read_csv(url)

# Options to display all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Display the first five rows
data.head()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,TechDoc,YearsCode,YearsCodePro,DevType,OrgSize,PurchaseInfluence,BuyNewTool,BuildvsBuy,TechEndorse,Country,Currency,CompTotal,LanguageHaveWorkedWith,LanguageWantToWorkWith,LanguageAdmired,DatabaseHaveWorkedWith,DatabaseWantToWorkWith,DatabaseAdmired,PlatformHaveWorkedWith,PlatformWantToWorkWith,PlatformAdmired,WebframeHaveWorkedWith,WebframeWantToWorkWith,WebframeAdmired,EmbeddedHaveWorkedWith,EmbeddedWantToWorkWith,EmbeddedAdmired,MiscTechHaveWorkedWith,MiscTechWantToWorkWith,MiscTechAdmired,ToolsTechHaveWorkedWith,ToolsTechWantToWorkWith,ToolsTechAdmired,NEWCollabToolsHaveWorkedWith,NEWCollabToolsWantToWorkWith,NEWCollabToolsAdmired,OpSysPersonal use,OpSysProfessional use,OfficeStackAsyncHaveWorkedWith,OfficeStackAsyncWantToWorkWith,OfficeStackAsyncAdmired,OfficeStackSyncHaveWorkedWith,OfficeStackSyncWantToWorkWith,OfficeStackSyncAdmired,AISearchDevHaveWorkedWith,AISearchDevWantToWorkWith,AISearchDevAdmired,NEWSOSites,SOVisitFreq,SOAccount,SOPartFreq,SOHow,SOComm,AISelect,AISent,AIBen,AIAcc,AIComplex,AIToolCurrently Using,AIToolInterested in Using,AIToolNot interested in Using,AINextMuch more integrated,AINextNo change,AINextMore integrated,AINextLess integrated,AINextMuch less integrated,AIThreat,AIEthics,AIChallenges,TBranch,ICorPM,WorkExp,Knowledge_1,Knowledge_2,Knowledge_3,Knowledge_4,Knowledge_5,Knowledge_6,Knowledge_7,Knowledge_8,Knowledge_9,Frequency_1,Frequency_2,Frequency_3,TimeSearching,TimeAnswering,Frustration,ProfessionalTech,ProfessionalCloud,ProfessionalQuestion,Industry,JobSatPoints_1,JobSatPoints_4,JobSatPoints_5,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,,,,,,,,,,United States of America,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,I have never visited Stack Overflow or the Sta...,,,,,,Yes,Very favorable,Increase productivity,,,,,,,,,,,,,,No,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,API document(s) and/or SDK document(s);User gu...,20.0,17.0,"Developer, full-stack",,,,,,United Kingdom of Great Britain and Northern I...,,,Bash/Shell (all shells);Go;HTML/CSS;Java;JavaS...,Bash/Shell (all shells);Go;HTML/CSS;Java;JavaS...,Bash/Shell (all shells);Go;HTML/CSS;Java;JavaS...,Dynamodb;MongoDB;PostgreSQL,PostgreSQL,PostgreSQL,Amazon Web Services (AWS);Heroku;Netlify,Amazon Web Services (AWS);Heroku;Netlify,Amazon Web Services (AWS);Heroku;Netlify,Express;Next.js;Node.js;React,Express;Htmx;Node.js;React;Remix,Express;Node.js;React,,,,,,,Docker;Homebrew;Kubernetes;npm;Vite;Webpack,Docker;Homebrew;Kubernetes;npm;Vite;Webpack,Docker;Homebrew;Kubernetes;npm;Vite;Webpack,PyCharm;Visual Studio Code;WebStorm,PyCharm;Visual Studio Code;WebStorm,PyCharm;Visual Studio Code;WebStorm,MacOS;Windows,MacOS,,,,Microsoft Teams;Slack,Slack,Slack,,,,Stack Overflow for Teams (private knowledge sh...,Multiple times per day,Yes,Multiple times per day,Quickly finding code solutions;Finding reliabl...,"Yes, definitely","No, and I don't plan to",,,,,,,,,,,,,,,,Yes,Individual contributor,17.0,Agree,Disagree,Agree,Agree,Agree,Neither agree nor disagree,Disagree,Agree,Agree,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,API document(s) and/or SDK document(s);User gu...,37.0,27.0,Developer Experience,,,,,,United Kingdom of Great Britain and Northern I...,,,C#,C#,C#,Firebase Realtime Database,Firebase Realtime Database,Firebase Realtime Database,Google Cloud,Google Cloud,Google Cloud,ASP.NET CORE,ASP.NET CORE,ASP.NET CORE,Rasberry Pi,Rasberry Pi,Rasberry Pi,.NET (5+) ;.NET Framework (1.0 - 4.8);.NET MAUI,.NET (5+) ;.NET Framework (1.0 - 4.8);.NET MAUI,.NET (5+) ;.NET Framework (1.0 - 4.8);.NET MAUI,MSBuild,MSBuild,MSBuild,Visual Studio,Visual Studio,Visual Studio,Windows,Windows,,,,Google Chat;Google Meet;Microsoft Teams;Zoom,Google Chat;Google Meet;Zoom,Google Chat;Google Meet;Zoom,,,,Stack Overflow;Stack Exchange;Stack Overflow B...,Multiple times per day,Yes,Multiple times per day,Quickly finding code solutions;Finding reliabl...,"Yes, definitely","No, and I don't plan to",,,,,,,,,,,,,,,,No,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,,4.0,,"Developer, full-stack",,,,,,Canada,,,C;C++;HTML/CSS;Java;JavaScript;PHP;PowerShell;...,HTML/CSS;Java;JavaScript;PowerShell;Python;SQL...,HTML/CSS;Java;JavaScript;PowerShell;Python;SQL...,MongoDB;MySQL;PostgreSQL;SQLite,MongoDB;MySQL;PostgreSQL,MongoDB;MySQL;PostgreSQL,Amazon Web Services (AWS);Fly.io;Heroku,Amazon Web Services (AWS);Vercel,Amazon Web Services (AWS),jQuery;Next.js;Node.js;React;WordPress,jQuery;Next.js;Node.js;React,jQuery;Next.js;Node.js;React,Rasberry Pi,,,NumPy;Pandas;Ruff;TensorFlow,,,Docker;npm;Pip,Docker;Kubernetes;npm,Docker;npm,,,,,,,,,,,,,,,Stack Overflow,Daily or almost daily,No,,Quickly finding code solutions,"No, not really",Yes,Very favorable,Increase productivity;Greater efficiency;Impro...,Somewhat trust,Bad at handling complex tasks,Learning about a codebase;Project planning;Wri...,Testing code;Committing and reviewing code;Pre...,,Learning about a codebase;Project planning;Wri...,,,,,No,Circulating misinformation or disinformation;M...,Don’t trust the output or answers,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,API document(s) and/or SDK document(s);User gu...,9.0,,"Developer, full-stack",,,,,,Norway,,,C++;HTML/CSS;JavaScript;Lua;Python;Rust,C++;HTML/CSS;JavaScript;Lua;Python,C++;HTML/CSS;JavaScript;Lua;Python,PostgreSQL;SQLite,PostgreSQL;SQLite,PostgreSQL;SQLite,,,,,,,CMake;Cargo;Rasberry Pi,CMake;Rasberry Pi,CMake;Rasberry Pi,,,,APT;Make;npm,APT;Make,APT;Make,Vim,Vim,Vim,Other (please specify):,,GitHub Discussions;Markdown File;Obsidian;Stac...,GitHub Discussions;Markdown File;Obsidian,GitHub Discussions;Markdown File;Obsidian,Discord;Whatsapp,Discord;Whatsapp,Discord;Whatsapp,,,,Stack Overflow for Teams (private knowledge sh...,Multiple times per day,Yes,Multiple times per day,Quickly finding code solutions;Engage with com...,"Yes, definitely","No, and I don't plan to",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Too short,Easy,,


## Handling Duplicate Rows

### Finding Duplicates

In [4]:
# Finding the number of duplicate rows
dup_rows = data[data.duplicated()]
num_dups = dup_rows.shape[0]
print(f'There are {num_dups} duplicate rows')

There are 0 duplicate rows


### Removing Duplicates

In [5]:
# Removing the duplicate rows and verifying
df = data.drop_duplicates()
num_dups = df[df.duplicated()].shape[0]
print(f'There are {num_dups} duplicate rows')

There are 0 duplicate rows


## Handling Missing Values

### Finding Missing Values

In [6]:
# This method displays the non-null counts for each column
# The total rows is 65437 
# Every column with less than the total row count means it has missing values 
df.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65437 entries, 0 to 65436
Data columns (total 114 columns):
 #    Column                          Non-Null Count  Dtype  
---   ------                          --------------  -----  
 0    ResponseId                      65437 non-null  int64  
 1    MainBranch                      65437 non-null  object 
 2    Age                             65437 non-null  object 
 3    Employment                      65437 non-null  object 
 4    RemoteWork                      54806 non-null  object 
 5    Check                           65437 non-null  object 
 6    CodingActivities                54466 non-null  object 
 7    EdLevel                         60784 non-null  object 
 8    LearnCode                       60488 non-null  object 
 9    LearnCodeOnline                 49237 non-null  object 
 10   TechDoc                         40897 non-null  object 
 11   YearsCode                       59869 non-null  object 
 12   YearsCodePro    

In [7]:
# This find the top 5 columns with the most missing values
df.isnull().sum().sort_values(ascending = False).head()

AINextMuch less integrated    64289
AINextLess integrated         63082
AINextNo change               52939
AINextMuch more integrated    51999
EmbeddedAdmired               48704
dtype: int64

### Droping Rows Based on Objective

Our original goal is to find data about the technologies. In that case, the most important columns are `LanguageHaveWorkedWith`, `LanguageWantToWorkWith`, `DatabaseHaveWorkedWith`,`DatabaseWantToWorkWith`, `PlatformHaveWorkedWith`, `PlatformWantToWorkWith`, `WebframeHaveWorkedWith`, `WebframeWantToWorkWith`, `ToolsTechHaveWorkedWith`, `ToolsTechWantToWorkWith`, `NEWCollabToolsHaveWorkedWith`, `NEWCollabToolsWantToWorkWith`. 

To get the most accurate date, it is best not to replace the missing values with the most frequent because that would skew the data. Here, I will find how many missing values are in these columns.

In [8]:
target_columns = ['LanguageHaveWorkedWith', 'LanguageWantToWorkWith', 'DatabaseHaveWorkedWith', 
                  'DatabaseWantToWorkWith', 'PlatformHaveWorkedWith', 'PlatformWantToWorkWith', 
                  'WebframeHaveWorkedWith', 'WebframeWantToWorkWith', 'ToolsTechHaveWorkedWith',
                  'ToolsTechWantToWorkWith', 'NEWCollabToolsHaveWorkedWith', 'NEWCollabToolsWantToWorkWith']

for column in target_columns:
    print(df[column].isnull().value_counts())
    print("")

LanguageHaveWorkedWith
False    59745
True      5692
Name: count, dtype: int64

LanguageWantToWorkWith
False    55752
True      9685
Name: count, dtype: int64

DatabaseHaveWorkedWith
False    50254
True     15183
Name: count, dtype: int64

DatabaseWantToWorkWith
False    42558
True     22879
Name: count, dtype: int64

PlatformHaveWorkedWith
False    42366
True     23071
Name: count, dtype: int64

PlatformWantToWorkWith
False    34532
True     30905
Name: count, dtype: int64

WebframeHaveWorkedWith
False    45161
True     20276
Name: count, dtype: int64

WebframeWantToWorkWith
False    38535
True     26902
Name: count, dtype: int64

ToolsTechHaveWorkedWith
False    52482
True     12955
Name: count, dtype: int64

ToolsTechWantToWorkWith
False    46084
True     19353
Name: count, dtype: int64

NEWCollabToolsHaveWorkedWith
False    57592
True      7845
Name: count, dtype: int64

NEWCollabToolsWantToWorkWith
False    52087
True     13350
Name: count, dtype: int64



To not remove an excessive amount of data, I will drop the rows with the least amount of NaN values. Based on that, it seems like the `LanguageHaveWorkedWith` column has the least.                          

In [9]:
# This will drop rows based only on the LanguageHaveWorkedWith column
df.dropna(subset=['LanguageHaveWorkedWith'], inplace = True)

In [10]:
# Verify the process
df['LanguageHaveWorkedWith'].isnull().value_counts()

LanguageHaveWorkedWith
False    59745
Name: count, dtype: int64

## Dropping Columns Based on Objective

There are many columns that I don't need in order to answer my questions. Therefore, I should only include relevant columns. This will also make the dataset smaller as well and easier to load.

In [11]:
df = df[['ResponseId', 'MainBranch', 'Age', 'Employment', 'RemoteWork', 'EdLevel', 'YearsCode',
         'YearsCodePro', 'DevType', 'Country', 'CompTotal', 'LanguageHaveWorkedWith', 
         'LanguageWantToWorkWith', 'DatabaseHaveWorkedWith', 'DatabaseWantToWorkWith', 
         'PlatformHaveWorkedWith', 'PlatformWantToWorkWith', 'WebframeHaveWorkedWith', 
         'WebframeWantToWorkWith', 'ToolsTechHaveWorkedWith', 'ToolsTechWantToWorkWith', 
         'NEWCollabToolsHaveWorkedWith', 'NEWCollabToolsWantToWorkWith', 'WorkExp', 'ConvertedCompYearly',
         'JobSat']]

In [12]:
df.head()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,EdLevel,YearsCode,YearsCodePro,DevType,Country,CompTotal,LanguageHaveWorkedWith,LanguageWantToWorkWith,DatabaseHaveWorkedWith,DatabaseWantToWorkWith,PlatformHaveWorkedWith,PlatformWantToWorkWith,WebframeHaveWorkedWith,WebframeWantToWorkWith,ToolsTechHaveWorkedWith,ToolsTechWantToWorkWith,NEWCollabToolsHaveWorkedWith,NEWCollabToolsWantToWorkWith,WorkExp,ConvertedCompYearly,JobSat
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",20,17.0,"Developer, full-stack",United Kingdom of Great Britain and Northern I...,,Bash/Shell (all shells);Go;HTML/CSS;Java;JavaS...,Bash/Shell (all shells);Go;HTML/CSS;Java;JavaS...,Dynamodb;MongoDB;PostgreSQL,PostgreSQL,Amazon Web Services (AWS);Heroku;Netlify,Amazon Web Services (AWS);Heroku;Netlify,Express;Next.js;Node.js;React,Express;Htmx;Node.js;React;Remix,Docker;Homebrew;Kubernetes;npm;Vite;Webpack,Docker;Homebrew;Kubernetes;npm;Vite;Webpack,PyCharm;Visual Studio Code;WebStorm,PyCharm;Visual Studio Code;WebStorm,17.0,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",37,27.0,Developer Experience,United Kingdom of Great Britain and Northern I...,,C#,C#,Firebase Realtime Database,Firebase Realtime Database,Google Cloud,Google Cloud,ASP.NET CORE,ASP.NET CORE,MSBuild,MSBuild,Visual Studio,Visual Studio,,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Some college/university study without earning ...,4,,"Developer, full-stack",Canada,,C;C++;HTML/CSS;Java;JavaScript;PHP;PowerShell;...,HTML/CSS;Java;JavaScript;PowerShell;Python;SQL...,MongoDB;MySQL;PostgreSQL;SQLite,MongoDB;MySQL;PostgreSQL,Amazon Web Services (AWS);Fly.io;Heroku,Amazon Web Services (AWS);Vercel,jQuery;Next.js;Node.js;React;WordPress,jQuery;Next.js;Node.js;React,Docker;npm;Pip,Docker;Kubernetes;npm,,,,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,"Secondary school (e.g. American high school, G...",9,,"Developer, full-stack",Norway,,C++;HTML/CSS;JavaScript;Lua;Python;Rust,C++;HTML/CSS;JavaScript;Lua;Python,PostgreSQL;SQLite,PostgreSQL;SQLite,,,,,APT;Make;npm,APT;Make,Vim,Vim,,,
5,6,I code primarily as a hobby,Under 18 years old,"Student, full-time",,Primary/elementary school,10,,Student,United States of America,,Bash/Shell (all shells);HTML/CSS;Java;JavaScri...,Bash/Shell (all shells);HTML/CSS;Java;JavaScri...,Cloud Firestore,Cloud Firestore,Cloudflare,Cloudflare,Node.js,Node.js,Docker;Homebrew;npm;Pip;pnpm,Homebrew;npm;Pip;pnpm,Nano;Vim;Visual Studio Code;Xcode,Nano;Vim;Visual Studio Code;Xcode,,,


### Imputing Missing Values in Numeric Columns

> Note: Since there are over 50% of missing values in some of the columns, I will not impute values for those columns. I will just leave the NaN values there for now.

#### YearsCode Column

Since the `YearsCode` column has a dtype of 'object', I will convert it to 'float' for imputing the average years. 

In [13]:
# Method 1 - fillna

# Replacing the strings to a number
df['YearsCode'].replace('Less than 1 year', '0', inplace = True)
df['YearsCode'].replace('More than 50 years', '50', inplace = True)
df['YearsCode'] = df['YearsCode'].astype(float)

# Replace NaN values with the average
avg_years = round(df['YearsCode'].mean(), 0)
df['YearsCode'].fillna(avg_years, inplace = True)

In [14]:
# Verify the imputation
df['YearsCode'].isnull().value_counts()

YearsCode
False    59745
Name: count, dtype: int64

#### YearsCodePro Column

The same thing applies to the `YearsCodePro` column. I will also convert it to 'float'.

In [15]:
# Method 2 - replace

# Replacing the strings to a number
df['YearsCodePro'].replace('Less than 1 year', '0', inplace = True)
df['YearsCodePro'].replace('More than 50 years', '50', inplace = True)
df['YearsCodePro'] = df['YearsCodePro'].astype(float)

# Replace NaN values with the average
avg_years = round(df['YearsCodePro'].mean(), 0)
df['YearsCodePro'].fillna(avg_years, inplace = True)

In [16]:
# Verify the imputation
df['YearsCodePro'].isnull().value_counts()

YearsCodePro
False    59745
Name: count, dtype: int64

### Imputing Missing Values in Categorical Columns

#### RemoteWork Column

In [17]:
# Method 1 - fillna

# Finding the most frequent value
most_remote = df['RemoteWork'].mode()[0]

# Filling the missing values
df['RemoteWork'].fillna(most_remote, inplace = True)

In [18]:
# Verify the imputation
df['RemoteWork'].isnull().value_counts()

RemoteWork
False    59745
Name: count, dtype: int64

#### EdLevel Column

In [19]:
# Method 2 - replace

# Finding the most frequent value
freq_ed_level = df['EdLevel'].mode()[0]

# Replace the NaN with most frequent value
df['EdLevel'].replace(np.nan, freq_ed_level, inplace = True)

In [20]:
# Verify the imputation
df['EdLevel'].isnull().value_counts()

EdLevel
False    59745
Name: count, dtype: int64

## Data Normalization

### Min-Max Scaling

(data - min) / (max - min)

#### YearsCode Column

In [21]:
# Creating a new column to place the normalize values
df['YearsCode_MinMax'] = (df['YearsCode'] - df['YearsCode'].min()) / (df['YearsCode'].max() - df['YearsCode'].min())

In [22]:
# Compare the normalized and original values
df[['YearsCode_MinMax', 'YearsCode']].head()

Unnamed: 0,YearsCode_MinMax,YearsCode
1,0.4,20.0
2,0.74,37.0
3,0.08,4.0
4,0.18,9.0
5,0.2,10.0


### Z-score normalization

(data - mean) / standard deviation

#### YearsCodePro Column

In [23]:
# Placing the normalize values in a new column
df['YearsCodePro_Zscore'] = (df['YearsCodePro'] - df['YearsCodePro'].mean()) / df['YearsCodePro'].std()

In [24]:
# Compare the normalized and original values
df[['YearsCodePro_Zscore', 'YearsCodePro']].head()

Unnamed: 0,YearsCodePro_Zscore,YearsCodePro
1,0.812509,17.0
2,2.010883,27.0
3,-0.026352,10.0
4,-0.026352,10.0
5,-0.026352,10.0


## Other Techniques

### Binning

Creating a new column: 'ExperienceLevel' based on the 'YearsCodePro' Column

In [25]:
# Create the ranges and labels
ranges = [0, 3, 5, 8, 10, 100]

# Store the names for each range
range_labels = ['Entry', 'Mid', 'Senior', 'Lead', 'Architect']

# Using the function cut to apply the bins
df['ExperienceLevel'] = pd.cut(df['YearsCodePro'], bins = ranges, labels = range_labels, include_lowest=True, ordered=False)

In [26]:
# Displaying 10 rows
df[['YearsCodePro', 'ExperienceLevel']].sample(n = 10, random_state = 42)

Unnamed: 0,YearsCodePro,ExperienceLevel
39276,2.0,Entry
2944,13.0,Architect
64994,19.0,Architect
39938,1.0,Entry
34270,9.0,Lead
14389,10.0,Lead
22694,5.0,Mid
30319,0.0,Entry
51141,20.0,Architect
27535,3.0,Entry


### One-hot Encoding



In [27]:
# Display the values and counts in the MainBranch column
df['MainBranch'].value_counts()

MainBranch
I am a developer by profession                                                           46236
I am not primarily a developer, but I write code sometimes as part of my work/studies     5956
I am learning to code                                                                     3163
I code primarily as a hobby                                                               3050
I used to be a developer by profession, but no longer am                                  1340
Name: count, dtype: int64

In [28]:
# Using the method get_dummies to encode the MainBranch column
# Values only consist of True or False
# Rename the columns for better readability
df_encoded = pd.get_dummies(df['MainBranch'])
df_encoded.columns = ['ProDeveloper', 'Learner', 'OccasionalCoder', 'HobbyCoder', 'FormerDev']
df_encoded.head()

Unnamed: 0,ProDeveloper,Learner,OccasionalCoder,HobbyCoder,FormerDev
1,True,False,False,False,False
2,True,False,False,False,False
3,False,True,False,False,False
4,True,False,False,False,False
5,False,False,False,True,False


In [29]:
# Adding the new encoded values to the dataframe
new_df = pd.concat([df, df_encoded], axis = 1)

## Exporting the Dataframe to a CSV 

In [30]:
df.to_csv("clean_survey_data.csv", index = False)