# StackOverflow Survey Data Analysis

## A Look at the Data

In order to get a better understanding of the data, some of the characteristics of the dataset will be explored as follows.

1. Number of rows and columns in this dataset.
2. Provide a set of column names that have no missing values.
3. Which columns have the most missing values? Provide a set of column names that have more than 75% of their values missing.
4. Provide a pandas series of the different **Professional** status values in the dataset along with the count of the number of individuals with each status.
5. Provide a pandas series of the different FormalEducation status values in the dataset along with the count of how many individuals received that formal education.
6. Provide a pandas series of the different Country values in the dataset along with the count of how many individuals are from each country. 

In [3]:
# Import necessary libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


os.chdir("projects_on_GitHub/cases/SurveyDataAnalysis")

In [4]:
df = pd.read_csv('./survey_results_public.csv')
df.head()

Unnamed: 0,Respondent,Professional,ProgramHobby,Country,University,EmploymentStatus,FormalEducation,MajorUndergrad,HomeRemote,CompanySize,...,StackOverflowMakeMoney,Gender,HighestEducationParents,Race,SurveyLong,QuestionsInteresting,QuestionsConfusing,InterestedAnswers,Salary,ExpectedSalary
0,1,Student,"Yes, both",United States,No,"Not employed, and not looking for work",Secondary school,,,,...,Strongly disagree,Male,High school,White or of European descent,Strongly disagree,Strongly agree,Disagree,Strongly agree,,
1,2,Student,"Yes, both",United Kingdom,"Yes, full-time",Employed part-time,Some college/university study without earning ...,Computer science or software engineering,"More than half, but not all, the time",20 to 99 employees,...,Strongly disagree,Male,A master's degree,White or of European descent,Somewhat agree,Somewhat agree,Disagree,Strongly agree,,37500.0
2,3,Professional developer,"Yes, both",United Kingdom,No,Employed full-time,Bachelor's degree,Computer science or software engineering,"Less than half the time, but at least one day ...","10,000 or more employees",...,Disagree,Male,A professional degree,White or of European descent,Somewhat agree,Agree,Disagree,Agree,113750.0,
3,4,Professional non-developer who sometimes write...,"Yes, both",United States,No,Employed full-time,Doctoral degree,A non-computer-focused engineering discipline,"Less than half the time, but at least one day ...","10,000 or more employees",...,Disagree,Male,A doctoral degree,White or of European descent,Agree,Agree,Somewhat agree,Strongly agree,,
4,5,Professional developer,"Yes, I program as a hobby",Switzerland,No,Employed full-time,Master's degree,Computer science or software engineering,Never,10 to 19 employees,...,,,,,,,,,,


In [5]:
# 1. Number of rows and columns in the dataset
print("There are {} rows and {} columns in the dataset.".format(df.shape[0], df.shape[1]))

There are 19102 rows and 154 columns in the dataset.


In [9]:
# 2. Provide a set of column names that have no missing values.
no_nulls = set(df.columns[df.isnull().sum()==0])
no_nulls

{'Country',
 'EmploymentStatus',
 'FormalEducation',
 'Professional',
 'ProgramHobby',
 'Respondent',
 'University'}

In [12]:
# Find columns with the most null values
na_max = df.isna().sum().max()
df.columns[df.isna().sum()==na_max]

Index(['ExCoderNotForMe', 'ExCoderWillNotCode'], dtype='object')

In [14]:
# For check purpose
df.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19102 entries, 0 to 19101
Data columns (total 154 columns):
 #    Column                            Non-Null Count  Dtype  
---   ------                            --------------  -----  
 0    Respondent                        19102 non-null  int64  
 1    Professional                      19102 non-null  object 
 2    ProgramHobby                      19102 non-null  object 
 3    Country                           19102 non-null  object 
 4    University                        19102 non-null  object 
 5    EmploymentStatus                  19102 non-null  object 
 6    FormalEducation                   19102 non-null  object 
 7    MajorUndergrad                    15899 non-null  object 
 8    HomeRemote                        16471 non-null  object 
 9    CompanySize                       14653 non-null  object 
 10   CompanyType                       14609 non-null  object 
 11   YearsProgram                      19005 non-null  ob

In [16]:
# 3. Provide a set of column names that have more than 75% of their values missing.
nulls_75plus = set(df.columns[df.isna().mean() >= .75])
nulls_75plus

{'ExCoder10Years',
 'ExCoderActive',
 'ExCoderBalance',
 'ExCoderBelonged',
 'ExCoderNotForMe',
 'ExCoderReturn',
 'ExCoderSkills',
 'ExCoderWillNotCode',
 'ExpectedSalary',
 'MobileDeveloperType',
 'NonDeveloperType',
 'TimeAfterBootcamp',
 'WebDeveloperType',
 'YearsCodedJobPast'}