# How sparse is my data?

Most data sets contain missing values, often represented as NaN (Not a Number). If you are working with Pandas you can easily check how many missing values exist in each column.

Let's find out how many of the developers taking the survey chose to enter their age (found in the Age column of so_survey_df) and their gender (Gender column of so_survey_df).

In [1]:
# import relevant libraries
import pandas as pd
import numpy as np

# read the dataset as a csv file
so_survey_df = pd.read_csv('./Datasets/Combined_DS_v10.csv')

# check
so_survey_df.head()

Unnamed: 0,SurveyDate,FormalEducation,ConvertedSalary,Hobby,Country,StackOverflowJobsRecommend,VersionControl,Age,Years Experience,Gender,RawSalary
0,2/28/18 20:20,Bachelor's degree (BA. BS. B.Eng.. etc.),,Yes,South Africa,,Git,21,13,Male,
1,6/28/18 13:26,Bachelor's degree (BA. BS. B.Eng.. etc.),70841.0,Yes,Sweeden,7.0,Git;Subversion,38,9,Male,70841.00
2,6/6/18 3:37,Bachelor's degree (BA. BS. B.Eng.. etc.),,No,Sweeden,8.0,Git,45,11,,
3,5/9/18 1:06,Some college/university study without earning ...,21426.0,Yes,Sweeden,,Zip file back-ups,46,12,Male,21426.00
4,4/12/18 22:41,Bachelor's degree (BA. BS. B.Eng.. etc.),41671.0,Yes,UK,8.0,Git,39,7,Male,"£41,671.00"


In [2]:
# Subset the DataFrame to only include the 'Age' and 'Gender' columns.
age_gender = so_survey_df[['Age', 'Gender']]

# Print the number of non-missing values in both columns.
age_gender.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Age     999 non-null    int64 
 1   Gender  693 non-null    object
dtypes: int64(1), object(1)
memory usage: 15.7+ KB


There are 693 non-missing entries in the Gender column.

# Finding the missing values

While having a summary of how much of your data is missing can be useful, often you will need to find the exact locations of these missing values. Using the same subset of the StackOverflow data from above (age_gender), you will show how a value can be flagged as missing.

In [3]:
# Print the first 10 entries of the DataFrame.
age_gender.head(10)

Unnamed: 0,Age,Gender
0,21,Male
1,38,Male
2,45,
3,46,Male
4,39,Male
5,39,Male
6,34,Male
7,24,Female
8,23,Male
9,36,


In [4]:
# Print the locations of the missing values in the first 10 rows.
age_gender.head(10).isnull()

Unnamed: 0,Age,Gender
0,False,False
1,False,False
2,False,True
3,False,False
4,False,False
5,False,False
6,False,False
7,False,False
8,False,False
9,False,True


In [5]:
# Print the locations of the non-missing values in the first 10 rows.
age_gender.head(10).notnull()

Unnamed: 0,Age,Gender
0,True,True
1,True,True
2,True,False
3,True,True
4,True,True
5,True,True
6,True,True
7,True,True
8,True,True
9,True,False


finding where the missing values exist can often be important

# Listwise deletion

The simplest way to deal with missing values in your dataset when they are occurring entirely at random is to remove those rows, also called 'listwise deletion'.

Depending on the use case, you will sometimes want to remove all missing values in your data while other times you may want to only remove a particular column if too many values are missing in that column.

In [6]:
# Print the number of rows and columns in so_survey_df.
so_survey_df.shape

(999, 11)

In [7]:
# Create a new DataFrame dropping all incomplete rows
no_missing_values_rows = so_survey_df.dropna()

# check
no_missing_values_rows.shape

(264, 11)

In [8]:
# Create a new DataFrame dropping all columns with incomplete rows
no_missing_values_cols = so_survey_df.dropna(axis=1)

#check
no_missing_values_cols.shape

(999, 7)

In [9]:
# Drop all rows where Gender is missing
no_gender = so_survey_df.dropna(subset = ['Gender'])

# check
no_gender.shape

(693, 11)

As you can see dropping all rows that contain any missing values may greatly reduce the size of your dataset. So you need to think carefully and consider several trade-offs when deleting missing values.

# Replacing missing values with constants

While removing missing data entirely maybe a correct approach in many situations, this may result in a lot of information being omitted from your models.

You may find categorical columns where the missing value is a valid piece of information in itself, such as someone refusing to answer a question in a survey. In these cases, you can fill all missing values with a new category entirely, for example 'No response given'.

In [10]:
# Print the count of occurrences of each category in so_survey_df's Gender column.
so_survey_df['Gender'].value_counts()

Male                                                                         632
Female                                                                        53
Female;Male                                                                    2
Transgender                                                                    2
Non-binary. genderqueer. or gender non-conforming                              1
Female;Male;Transgender;Non-binary. genderqueer. or gender non-conforming      1
Female;Transgender                                                             1
Male;Non-binary. genderqueer. or gender non-conforming                         1
Name: Gender, dtype: int64

In [11]:
# Replace all missing values in the Gender column with the string 'Not Given'.
# Make changes to the original DataFrame.
so_survey_df['Gender'].fillna(value = 'Not Given', inplace = True)

# check
so_survey_df['Gender'].value_counts()

Male                                                                         632
Not Given                                                                    306
Female                                                                        53
Female;Male                                                                    2
Transgender                                                                    2
Non-binary. genderqueer. or gender non-conforming                              1
Female;Male;Transgender;Non-binary. genderqueer. or gender non-conforming      1
Female;Transgender                                                             1
Male;Non-binary. genderqueer. or gender non-conforming                         1
Name: Gender, dtype: int64

By filling in these missing values you can use the columns in your analyses.

# Filling continuous missing values

In the last lesson, you dealt with different methods of removing data missing values and filling in missing values with a fixed string. These approaches are valid in many cases, particularly when dealing with categorical columns but have limited use when working with continuous values. In these cases, it may be most valid to fill the missing values in the column with a value calculated from the entries present in the column.

In [12]:
# Print the first five rows of the StackOverflowJobsRecommend column of so_survey_df.
so_survey_df['StackOverflowJobsRecommend'].head()

0    NaN
1    7.0
2    8.0
3    NaN
4    8.0
Name: StackOverflowJobsRecommend, dtype: float64

In [13]:
# Replace the missing values in the StackOverflowJobsRecommend column with its mean.
# Make changes directly to the original DataFrame.
# Round the decimal values that you introduced in the StackOverflowJobsRecommend column.
so_survey_df['StackOverflowJobsRecommend'].fillna(round(so_survey_df['StackOverflowJobsRecommend'].mean()),
                                                 inplace = True)

# check
so_survey_df['StackOverflowJobsRecommend'].head()

0    7.0
1    7.0
2    8.0
3    7.0
4    8.0
Name: StackOverflowJobsRecommend, dtype: float64

Remember you should only round your values if you are certain it is applicable.

When working with predictive models you will often have a separate train and test DataFrames. In these cases you want to ensure no information from your test set leaks into your train set. When filling missing values in data to be used in these situations Apply the measures of central tendency (mean/median etc.) calculated on the train set to both the train and test sets.  
Values calculated on the train test should be applied to both DataFrames.

# Dealing with stray characters (I)

In this exercise, you will work with the RawSalary column of so_survey_df which contains the wages of the respondents along with the currency symbols and commas, such as $42,000. When importing data from Microsoft Excel, more often that not you will come across data in this form.

In [14]:
# Lets first check the data type of this column
so_survey_df['RawSalary'].dtype

dtype('O')

This output shows that this column is of string data type while it should be a numeric column

In [15]:
# Remove the commas (,) from the RawSalary column.
so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace(',', '')

In [16]:
# Remove the dollar ($) signs from the RawSalary column.
so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace('$', '')

Replacing/removing specific characters is a very useful skill.

# Dealing with stray characters (II)

you could tell quickly based off of the df.head() call which characters were causing an issue. In many cases this will not be so apparent. There will often be values deep within a column that are preventing you from casting a column as a numeric type so that it can be used in a model or further feature engineering.

One approach to finding these values is to force the column to the data type desired using pd.to_numeric(), coercing any values causing issues to NaN, Then filtering the DataFrame by just the rows containing the NaN values.

Try to cast the RawSalary column as a float and it will fail as an additional character can now be found in it. Find the character and remove it so the column can be cast as a float.

In [17]:
# Attempt to convert the RawSalary column of so_survey_df to numeric values
# coercing all failures into null values.
numeric_vals = pd.to_numeric(so_survey_df['RawSalary'], errors='coerce')

# check
numeric_vals

0            NaN
1        70841.0
2            NaN
3        21426.0
4            NaN
         ...    
994          NaN
995      58746.0
996      55000.0
997          NaN
998    1000000.0
Name: RawSalary, Length: 999, dtype: float64

In [18]:
# Find the indexes of the rows containing NaNs.
idx = numeric_vals.isnull()

# check
idx

0       True
1      False
2       True
3      False
4       True
       ...  
994     True
995    False
996    False
997     True
998    False
Name: RawSalary, Length: 999, dtype: bool

In [19]:
# Print the rows in RawSalary based on these indexes.
so_survey_df['RawSalary'][idx]

0            NaN
2            NaN
4      £41671.00
6            NaN
8            NaN
         ...    
989          NaN
990          NaN
992          NaN
994          NaN
997          NaN
Name: RawSalary, Length: 401, dtype: object

Notice the pound (£) signs in the RawSalary column, Remove these signs like you did in the previous exercise.

In [20]:
# Remove the (£) signs from the RawSalary column.
so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace('£', '')

# Convert the column to float
so_survey_df['RawSalary'] = so_survey_df['RawSalary'].astype(float)

# check
so_survey_df['RawSalary'].head()

0        NaN
1    70841.0
2        NaN
3    21426.0
4    41671.0
Name: RawSalary, dtype: float64

Remember that even after removing all the relevant characters, you still need to change the type of the column to numeric if you want to plot these continuous values.

# Method chaining

When applying multiple operations on the same column (like in the previous exercises), you made the changes in several steps, assigning the results back in each step. However, when applying multiple successive operations on the same column, you can "chain" these operations together for clarity and ease of management. This can be achieved by calling multiple methods sequentially:

We can use method chaining to achieve the above tasks by running the code:

```
# Use method chaining
so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace(',', '')\
                              .str.replace('$', '')\
                              .str.replace('£', '')\
                              .astype(float)
 
# Print the RawSalary column
print(so_survey_df['RawSalary'])
```

Custom functions can be also used when method chaining using the `.apply()` method.