# Data cleaning 
In most of the data that you'll be dealing with you're going to have some issues with the data itself. Things such as NaN or some other kind of invalid value.

In [None]:

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'first': ['Corey', 'Jane', 'John', 'Chris', np.nan, None, 'NA'], 
    'last': ['Schafer', 'Doe', 'Doe', 'Schafer', np.nan, np.nan, 'Missing'], 
    'email': ['CoreyMSchafer@gmail.com', 'JaneDoe@email.com', 'JohnDoe@email.com', None, np.nan, 'Anonymous@email.com', 'NA'],
    'age': ['33', '55', '63', '36', None, None, 'Missing']
})

'''
1. Drop all rows with missing values, so if ; do reassignment if you want to apply these changes of course

Default behavior: df.dropna(axis="index", how="any")
So 'index' means to drop a ROWS if they had ANY missing values (NaN).
'''
df.dropna()

'''
2. Maybe missing some values is alright, but if all values are missing then 
drop the rows.
'''
df.dropna(axis="index" how="all")


'''
3. Let's say we don't care about missing the first or last name, but
if it's missing the email address, then we need to drop the row. 
'''
df.dropna(axis='index', subset=['email'])

'''
4. We need the last name or the email. We just need one of them. So this will 
  read: "Drop the row of ALL of the values in subset are missing". So if email and 
  last name don't exist, then drop the row. If at least one of them exist then keep the row.

NOTE: Remember that this returns a result data frame, and to actually apply these drop changes, you 
have to do inplace=True or reassignment.
'''
df.dropna(axis='index', subset=['email', 'last'])

'''
5. Let's say people didn't know what to do and entered 'MISSING' or 'N/A' instead of leaving things blank.
So to deal with this, we'll use numpy, and convert those custom missing values into a proper NaN. Then 
you can proceed with data processing like normal.
'''
df.replace("NA", np.nan, inplace=True)
df.replace("MISSING", np.nan, inplace=True)

'''
6. Gives us a mask or table of booleans, indicating whether something 
is NaN or not.
'''
df.isna()

'''
7. Let's say a student has an assignment 'NaN' for missing assignment. We want to make 
it so all NaN values would be replaced with the number '0' for their score on the assignment.

Replaces all NA values with 0 string.
'''
df.fillna(0, inplace=True)

'''
8. NaN values are actually represented as floats under the hood.
You can't convert NaN into integer, but the solution is to have everything as float. So convert all of the values in the age series 
into floats.
'''
df['age'] = df['age'].astype(float)



In [8]:
'''
+ Analysis on Stackoverflow data
1. This is more appropriate for the last dataset, but if we had custom 'missing' values such as 'NA' and 'Missing'. Then by passing in this 
array, when the CSV loads in, it'll convert any cell with 'NA' or 'Missing' into a numpy NaN.

2. Let's try to find the average years a person has been coding. We can do 'YearsCode', however we'll get an error  'can only concatenate str (not 'int') to str'.
This is trying to tell us that the values in YearsCode are strings. So like do 'df["YearsCode"].apply(lambda x: type(x))' to see that the 'numbers' are actually strings, 
whilst the NaN are floats (which we talked about before). A simple trick is to convert all of these values into floats.
However this will also have issues as we'll have values in the series such as 'Less than 1 year'. First let's see all of the possible unique values in this YearsCode column

3. Replace 'Less than 1 year' with 0. Then replace 'More than 50 years', probably with the value of 51 for a good estimate. Now we should be able to convert 
everything to a float.

4. Now get the mean of the 'YearsCode' column
'''


import pandas as pd
csvPath = "../data/survey_results_public.csv"
na_values = ['NA','Missing']

# 2
df = pd.read_csv(csvPath, index_col="Respondent", na_values=na_values)
df["YearsCode"].unique()


# 3
df["YearsCode"] = df["YearsCode"].replace({"Less than 1 year": 0, "More than 50 years": 51})
df["YearsCode"] = df["YearsCode"].astype(float)

# 4
df["YearsCode"].mean()



np.float64(11.662114216834588)