# PROJECT 1: CASE STUDY #

#### Downloading the dataset for the project ####

##### Using the helper library (opendataset) to download stackoverflow-developer-survey-2023 Firstly, install the library if not installed #####

In [1]:
pip install opendatasets

Note: you may need to restart the kernel to use updated packages.


In [2]:
import opendatasets as od 
import os
import pandas as pd

In [3]:
od.download('stackoverflow-developer-survey-2020')

Using downloaded and verified file: .\stackoverflow-developer-survey-2020\survey_results_public.csv
Using downloaded and verified file: .\stackoverflow-developer-survey-2020\survey_results_schema.csv
Using downloaded and verified file: .\stackoverflow-developer-survey-2020\README.txt


In [4]:
os.listdir('stackoverflow-developer-survey-2020')

['README.txt', 'survey_results_public.csv', 'survey_results_schema.csv']

In [5]:
survey_raw_df = pd.read_csv('stackoverflow-developer-survey-2020/survey_results_public.csv')

In [6]:
survey_raw_df

Unnamed: 0,Respondent,MainBranch,Hobbyist,Age,Age1stCode,CompFreq,CompTotal,ConvertedComp,Country,CurrencyDesc,...,SurveyEase,SurveyLength,Trans,UndergradMajor,WebframeDesireNextYear,WebframeWorkedWith,WelcomeChange,WorkWeekHrs,YearsCode,YearsCodePro
0,1,I am a developer by profession,Yes,,13,Monthly,,,Germany,European Euro,...,Neither easy nor difficult,Appropriate in length,No,"Computer science, computer engineering, or sof...",ASP.NET Core,ASP.NET;ASP.NET Core,Just as welcome now as I felt last year,50.0,36,27
1,2,I am a developer by profession,No,,19,,,,United Kingdom,Pound sterling,...,,,,"Computer science, computer engineering, or sof...",,,Somewhat more welcome now than last year,,7,4
2,3,I code primarily as a hobby,Yes,,15,,,,Russian Federation,,...,Neither easy nor difficult,Appropriate in length,,,,,Somewhat more welcome now than last year,,4,
3,4,I am a developer by profession,Yes,25.0,18,,,,Albania,Albanian lek,...,,,No,"Computer science, computer engineering, or sof...",,,Somewhat less welcome now than last year,40.0,7,4
4,5,"I used to be a developer by profession, but no...",Yes,31.0,16,,,,United States,,...,Easy,Too short,No,"Computer science, computer engineering, or sof...",Django;Ruby on Rails,Ruby on Rails,Just as welcome now as I felt last year,,15,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57173,58300,I am a developer by profession,No,,8,,,,United States,United States dollar,...,,,,"A humanities discipline (such as literature, h...",,,Just as welcome now as I felt last year,,10,10
57174,58301,I code primarily as a hobby,Yes,27.0,21,,,,Pakistan,,...,Easy,Too long,No,"Computer science, computer engineering, or sof...",,Flask;Laravel,Somewhat more welcome now than last year,,5,
57175,58302,I am a developer by profession,Yes,,14,Yearly,59000.0,44622.0,Canada,Canadian dollar,...,Neither easy nor difficult,Appropriate in length,No,"Computer science, computer engineering, or sof...",Express,ASP.NET;ASP.NET Core;Express;Vue.js,Just as welcome now as I felt last year,40.0,7,2
57176,58303,I am a developer by profession,Yes,29.0,19,Monthly,,,Madagascar,Malagasy ariary,...,Easy,Too long,No,"Computer science, computer engineering, or sof...",Angular;Express;jQuery;React.js;Spring;Vue.js,Angular;Express;jQuery;React.js;Spring;Symfony...,Just as welcome now as I felt last year,40.0,10,5


In [7]:
survey_raw_df.columns

Index(['Respondent', 'MainBranch', 'Hobbyist', 'Age', 'Age1stCode', 'CompFreq',
       'CompTotal', 'ConvertedComp', 'Country', 'CurrencyDesc',
       'CurrencySymbol', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
       'DevType', 'EdLevel', 'Employment', 'Ethnicity', 'Gender', 'JobFactors',
       'JobSat', 'JobSeek', 'LanguageDesireNextYear', 'LanguageWorkedWith',
       'MiscTechDesireNextYear', 'MiscTechWorkedWith',
       'NEWCollabToolsDesireNextYear', 'NEWCollabToolsWorkedWith', 'NEWDevOps',
       'NEWDevOpsImpt', 'NEWEdImpt', 'NEWJobHunt', 'NEWJobHuntResearch',
       'NEWLearn', 'NEWOffTopic', 'NEWOnboardGood', 'NEWOtherComms',
       'NEWOvertime', 'NEWPurchaseResearch', 'NEWPurpleLink', 'NEWSOSites',
       'NEWStuck', 'OpSys', 'OrgSize', 'PlatformDesireNextYear',
       'PlatformWorkedWith', 'PurchaseWhat', 'Sexuality', 'SOAccount',
       'SOComm', 'SOPartFreq', 'SOVisitFreq', 'SurveyEase', 'SurveyLength',
       'Trans', 'UndergradMajor', 'WebframeDesireNextYear',
  

In [8]:
schema_fname = 'stackoverflow-developer-survey-2020/survey_results_schema.csv'
schema_raw = pd.read_csv(schema_fname, index_col='Column').QuestionText

In [9]:
schema_raw

Column
Respondent            Randomized respondent ID number (not in order ...
MainBranch            Which of the following options best describes ...
Hobbyist                                        Do you code as a hobby?
Age                   What is your age (in years)? If you prefer not...
Age1stCode            At what age did you write your first line of c...
                                            ...                        
WebframeWorkedWith    Which web frameworks have you done extensive d...
WelcomeChange         Compared to last year, how welcome do you feel...
WorkWeekHrs           On average, how many hours per week do you wor...
YearsCode             Including any education, how many years have y...
YearsCodePro          NOT including education, how many years have y...
Name: QuestionText, Length: 61, dtype: object

In [10]:
schema_raw['YearsCodePro']

'NOT including education, how many years have you coded professionally (as a part of your work)?'

In [11]:
schema_raw['Age']

'What is your age (in years)? If you prefer not to answer, you may leave this question blank.'

## Data Preparation & Cleaning ## 

While the survey responses contain a wealth of information, we'll limit our analysis to the following areas:

 - Demographics of the survey respondents and the global programming community
 - Distribution of programming skills, experience, and preferences
 - Employment-related information, preferences, and opinions
Let's select a subset of columns with the relevant data for our analysis. 

In [12]:
selected_columns = [
    # Demographics
    'Country',
    'Age',
    'Gender',
    'EdLevel',
    'UndergradMajor',
    # Programming experience
    'Hobbyist',
    'Age1stCode',
    'YearsCode',
    'YearsCodePro',
    'LanguageWorkedWith',
    'LanguageDesireNextYear',
    'NEWLearn',
    'NEWStuck',
    # Employment
    'Employment',
    'DevType',
    'WorkWeekHrs',
    'JobSat',
    'JobFactors',
    'NEWOvertime',
    'NEWEdImpt'
]

In [13]:
len(selected_columns)

20

In [14]:
survey_df = survey_raw_df[selected_columns].copy()

In [15]:
schema = schema_raw[selected_columns]

In [16]:
survey_df

Unnamed: 0,Country,Age,Gender,EdLevel,UndergradMajor,Hobbyist,Age1stCode,YearsCode,YearsCodePro,LanguageWorkedWith,LanguageDesireNextYear,NEWLearn,NEWStuck,Employment,DevType,WorkWeekHrs,JobSat,JobFactors,NEWOvertime,NEWEdImpt
0,Germany,,Man,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...",Yes,13,36,27,C#;HTML/CSS;JavaScript,C#;HTML/CSS;JavaScript,Once a year,Visit Stack Overflow;Go for a walk or other ph...,"Independent contractor, freelancer, or self-em...","Developer, desktop or enterprise applications;...",50.0,Slightly satisfied,"Languages, frameworks, and other technologies ...",Often: 1-2 days per week or more,Fairly important
1,United Kingdom,,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Computer science, computer engineering, or sof...",No,19,7,4,JavaScript;Swift,Python;Swift,Once a year,Visit Stack Overflow;Go for a walk or other ph...,Employed full-time,"Developer, full-stack;Developer, mobile",,Very dissatisfied,,,Fairly important
2,Russian Federation,,,,,Yes,15,4,,Objective-C;Python;Swift,Objective-C;Python;Swift,Once a decade,,,,,,,,
3,Albania,25.0,Man,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...",Yes,18,7,4,,,Once a year,,,,40.0,Slightly dissatisfied,Flex time or a flexible schedule;Office enviro...,Occasionally: 1-2 days per quarter but less th...,Not at all important/not necessary
4,United States,31.0,Man,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Computer science, computer engineering, or sof...",Yes,16,15,8,HTML/CSS;Ruby;SQL,Java;Ruby;Scala,Once a year,Call a coworker or friend;Visit Stack Overflow...,Employed full-time,,,,,,Very important
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57173,United States,,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","A humanities discipline (such as literature, h...",No,8,10,10,,,,,Employed full-time,Data or business analyst;Senior executive/VP,,Very satisfied,,,Not at all important/not necessary
57174,Pakistan,27.0,Man,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...",Yes,21,5,,C;C#;HTML/CSS;Java;PHP;SQL,,Once every few years,Visit Stack Overflow,Employed full-time,"Developer, back-end",,,,,Very important
57175,Canada,,Man,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Computer science, computer engineering, or sof...",Yes,14,7,2,C;C++;Go;Java;JavaScript;Python;Rust;SQL,C;JavaScript;Python;Rust;SQL,Every few months,Meditate;Go for a walk or other physical activ...,Employed full-time,"Developer, back-end;Developer, front-end;Devel...",40.0,Slightly dissatisfied,Specific department or team I’d be working on;...,Occasionally: 1-2 days per quarter but less th...,Very important
57176,Madagascar,29.0,Man,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...",Yes,19,10,5,HTML/CSS;Java;JavaScript;PHP;TypeScript,,Every few months,Meditate;Call a coworker or friend;Visit Stack...,Employed full-time,"Database administrator;Developer, back-end;Dev...",40.0,Slightly satisfied,Flex time or a flexible schedule;Remote work o...,Sometimes: 1-2 days per month but less than we...,Somewhat important


In [17]:
survey_df.shape

(57178, 20)

In [18]:
print('The dataset contain', survey_df.shape[0], 'rows')

The dataset contain 57178 rows


In [19]:
print('The dataset contain', survey_df.shape[1], 'columns')

The dataset contain 20 columns


In [20]:
survey_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57178 entries, 0 to 57177
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Country                 56928 non-null  object 
 1   Age                     42188 non-null  float64
 2   Gender                  46903 non-null  object 
 3   EdLevel                 52086 non-null  object 
 4   UndergradMajor          46378 non-null  object 
 5   Hobbyist                57141 non-null  object 
 6   Age1stCode              52606 non-null  object 
 7   YearsCode               52407 non-null  object 
 8   YearsCodePro            42334 non-null  object 
 9   LanguageWorkedWith      52084 non-null  object 
 10  LanguageDesireNextYear  49150 non-null  object 
 11  NEWLearn                50898 non-null  object 
 12  NEWStuck                50337 non-null  object 
 13  Employment              56726 non-null  object 
 14  DevType                 45061 non-null

In [21]:
survey_df['Age1stCode'] = pd.to_numeric(survey_df.Age1stCode, errors='coerce')
survey_df['YearsCode'] = pd.to_numeric(survey_df.YearsCode, errors='coerce')
survey_df['YearsCodePro'] = pd.to_numeric(survey_df.YearsCodePro, errors='coerce')

In [22]:
survey_df.describe()

Unnamed: 0,Age,Age1stCode,YearsCode,YearsCodePro,WorkWeekHrs
count,42188.0,52236.0,51625.0,40405.0,38029.0
mean,30.903942,15.447795,12.88122,8.907289,40.803372
std,9.599753,5.096677,9.505082,7.758968,17.763988
min,1.0,5.0,1.0,1.0,1.0
25%,24.0,12.0,6.0,3.0,40.0
50%,29.0,15.0,10.0,6.0,40.0
75%,35.0,18.0,17.0,12.0,44.0
max,279.0,85.0,50.0,50.0,475.0


In [23]:
survey_df.drop(survey_df[survey_df.Age < 10].index, inplace=True)
survey_df.drop(survey_df[survey_df.Age > 100].index, inplace=True)

In [24]:
survey_df.drop(survey_df[survey_df.WorkWeekHrs > 140].index, inplace=True)

In [25]:
survey_df['Gender'].value_counts()

Gender
Man                                                            42578
Woman                                                           3568
Non-binary, genderqueer, or gender non-conforming                356
Man;Non-binary, genderqueer, or gender non-conforming            110
Woman;Non-binary, genderqueer, or gender non-conforming           80
Woman;Man                                                         68
Woman;Man;Non-binary, genderqueer, or gender non-conforming       23
Name: count, dtype: int64

In [26]:
import numpy as np

In [27]:
survey_df.where(~(survey_df.Gender.str.contains(';', na=False)), np.nan, inplace=True)

In [28]:
survey_df.sample(10)

Unnamed: 0,Country,Age,Gender,EdLevel,UndergradMajor,Hobbyist,Age1stCode,YearsCode,YearsCodePro,LanguageWorkedWith,LanguageDesireNextYear,NEWLearn,NEWStuck,Employment,DevType,WorkWeekHrs,JobSat,JobFactors,NEWOvertime,NEWEdImpt
2447,United States,30.0,Man,Some college/university study without earning ...,,Yes,27.0,3.0,,HTML/CSS;Java;JavaScript;Python,Bash/Shell/PowerShell;Java;JavaScript;Kotlin;P...,Every few months,Visit Stack Overflow;Go for a walk or other ph...,"Independent contractor, freelancer, or self-em...","Developer, back-end;Developer, desktop or ente...",,,,,Not at all important/not necessary
39730,Mexico,,,,,No,,,,,,Once every few years,,Employed full-time,,,,,,
56427,France,24.0,Man,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Web development or web design,Yes,14.0,4.0,2.0,Bash/Shell/PowerShell;Go;Java;JavaScript;Kotli...,Go;JavaScript;Kotlin;PHP;SQL;TypeScript,Every few months,Call a coworker or friend;Visit Stack Overflow...,Employed full-time,"Developer, back-end;Developer, full-stack;Deve...",35.0,Very satisfied,"Flex time or a flexible schedule;Languages, fr...",Often: 1-2 days per week or more,Not at all important/not necessary
27694,India,29.0,Man,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)","Information systems, information technology, o...",Yes,18.0,12.0,6.0,Objective-C;Swift,Dart;Swift,Once every few years,Visit Stack Overflow;Watch help / tutorial videos,Employed full-time,"Developer, mobile",48.0,Very satisfied,Industry that I’d be working in;Flex time or a...,Occasionally: 1-2 days per quarter but less th...,Fairly important
8424,India,,,,,Yes,,,,Java,Dart;Kotlin,Once a year,Visit Stack Overflow;Go for a walk or other ph...,Employed full-time,,,,,,
40098,Canada,,,"Associate degree (A.A., A.S., etc.)","Information systems, information technology, o...",Yes,13.0,20.0,15.0,,,,,Employed full-time,"Developer, front-end;Developer, full-stack;Dev...",38.0,Slightly dissatisfied,"Languages, frameworks, and other technologies ...",Occasionally: 1-2 days per quarter but less th...,Somewhat important
27893,United States,17.0,Woman,"Secondary school (e.g. American high school, G...",,Yes,13.0,4.0,,C#;HTML/CSS;Java;JavaScript,Assembly;C;C++;Rust,Every few months,Play games;Visit Stack Overflow;Watch help / t...,"Not employed, but looking for work","Developer, back-end;Developer, game or graphics",,,,,Not at all important/not necessary
2252,United States,33.0,Man,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)","A social science (such as anthropology, psycho...",Yes,23.0,10.0,7.0,C#;Go;HTML/CSS;JavaScript;Python;SQL,Go;JavaScript;Python;Scala;SQL,Once a year,Visit Stack Overflow;Go for a walk or other ph...,Employed full-time,Data scientist or machine learning specialist,45.0,Neither satisfied nor dissatisfied,Financial performance or funding status of the...,Sometimes: 1-2 days per month but less than we...,Somewhat important
26409,Germany,15.0,Man,"Secondary school (e.g. American high school, G...",,Yes,10.0,5.0,,Bash/Shell/PowerShell;C;C#;HTML/CSS;Java;JavaS...,Assembly;Bash/Shell/PowerShell;C;Haskell;HTML/...,Once a year,Watch help / tutorial videos;Do other work and...,Student,,,,,,
6086,India,25.0,Man,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)","Computer science, computer engineering, or sof...",Yes,19.0,5.0,4.0,C;C++;HTML/CSS;Java;Kotlin,JavaScript;Kotlin;Python;Swift,Once every few years,Call a coworker or friend;Visit Stack Overflow...,Employed full-time,"Developer, mobile;Developer, QA or test",43.0,Slightly satisfied,Flex time or a flexible schedule;Opportunities...,Sometimes: 1-2 days per month but less than we...,Very important


## Exploratory Analysis and Visualization ##

- Explore the distribution of sales over time.
- Analyze the top-selling products.
- Investigate the relationship between quantity sold and revenue.
- Identify any trends or patterns in customer behavior.

In [29]:
sales_over_time = survey_df.groupby(survey_df['Age'].dt.year)['Sales'].sum()
sales_over_time

AttributeError: Can only use .dt accessor with datetimelike values

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.figure(figsize=(12,6))
sales_over_time.plot(kind='line', color='skyblue', marker='o')
plt.title('Distribution of Sale Over Time')
plt.xlabel("Year")
plt.ylabel('Sales Over Time')

plt.xticks(sales_over_time.index.astype(int))
plt.grid(True)
plt.show()

NameError: name 'sales_over_time' is not defined

<Figure size 1200x600 with 0 Axes>