# Cleaning Data in Python
* 感觉这个课更多是讲清理的思路，where to clean, what to clean
* 而pandas系列课程偏重学会各种function

content  
* Exploring your data
  * Diagnose data for cleaning
  * Exploratory data analysis
  * Visual exploratory data analysis
* Tidying data for analysis ----------------- ☆
	* Tidy data
	* Pivoting data
	* Beyond melt and pivot
* Combining data for analysis
	* Concatenating data
	* Finding and concatenating data ---- ☆
	* Merge data
* Cleaning data for analysis
	* Data types
	* Using regular expressions to clean strings
	* Using functions to clean data ---------- ☆☆
	* Duplicate and missing data -------------- ☆
	* Testing with asserts
* Case study
	* Putting it all together
	* Initial impressions of the data
	* Final thoughts

## 1 exploring your data


### 1.1 diagnose data for cleaning

common data problems
* inconsistent column names 
* missing data 
* outliers
* duplicate rows
* untidy
* need to process columns 
* column types can signal unexpected data values

In [None]:
# think about column names and missing vals 
df.head()
df.tail()
df.columns()
df.shape
df.info()

### 1.2 exploratory data analysis
* to help identify data that needs further investigation
    * count the number of unique vals 

In [None]:
# non-numeric cols: frequency count to spot NaN
df.continent.value_counts(dropna=False)   # count also the missing vals

# numeric cols: summary stats to spot outlier
df.describe()  # only results of num cols returned

### 1.3 Visual exploratory data analysis
when there's too many cols, summary stats alone can be overwhelming  
use visual aid to :  
   * spot outliers and obvious errors
   * look for pattern

In [None]:
# bar plots and histograms
    # bar plots for deicrete data counts
    # hist for continuous data counts
        * df.colA.plot(kind='hist')
        * df[df.population > 1000000000]   # select a subset to see if some error
    # is there any unexpected freq of vals?
    
# boxplots
    # outliers, percentiles
    # when you have a numeric column that you want to compare across different categories
    * df.boxplot(column='population', by'continent')
    
# scatter plots
    # relationship between 2 numeric vals
    # potentially bad data
        # eg. errors not found by looking at 1 val
        # select a subset to see if some error

## 2 Tidying data for analysis
### 2.1 Tidy data  
* Principles of tidy data
    * Columns represent separate variables 
    * Rows represent individual observations 
    * Observational units form tables（won't cover in this chap)
* tidy data  
    * better for analysis
    * easier to fix common data problems  
    * able to transfer data into diff shapes needed  
* problem when tidying data
    * columns containing values other than variables
    * solution: pd.melt()  
    
 (可是，为什么tidy data is a better format for analysis呢？)  
 现在能想到的点是：方便groupby，方便画图(x=category_col, y=value, 加上一个aggregation func)  

In [None]:
# imagine this df
df.columns = ['name', 'treatmentA', 'treatmentB']
df.values[0] = [some_name, result_of_treA, result_of_treB] # this is vals in a row in df 

# 'treatmentA' and 'treatmentB' should be the value of 'treatment' col
# pd.melt()
pd.melt(df, 
        id_vars='name',          # the col to remain the same(idntical for other cols) 
        value_vars=['treatmentA', 'treatmentB'],   # the cols to melt
                                                   # by default all cols other than id_vars
        var_name='treatment',    # 'trA','trB'两栏被合并到一栏后，new col name
        value_name='result'      # 从cell中释放出的vals成为新的一栏后，new col name
       )

### 2.2 Pivoting data  
* melt和pivot方向相反
    * melting: turn cols into rows
    * pivoting: turn unique vals into seperate cols 
* pivot用处
    * analysis-friendly shape to reporing-friendly shape
    * multiple vars stored in the same col ->> reshape into tidy
* 我理解  
    一个光谱，最左边<<-是最长的long format，最右->>是最宽的wide format  
    tidy shape and data in hand both lies somewhere in between  
    要向long form挪，<<-pd.melt()  
    要向wide form挪，->>df.pivot() / df.pivot_table()

In [None]:
# in df.pivot, each index-columns pair should be unique
df.pivot(index = cols_you_dont_want_to_reform, 
         columns= cols_you_want_each_unique_vals_has_a_sep_col,
         values= col_you_want_to_pivot_in_cell  # each index-col pair responds one value
        )

# in df.pivot_table(), each index-columns pair can have multi vals
df.pivot_table(index= ,
               columns= ,
               values= ,
               aggfunc=    # to aggregate multi vals of one index-col pair
                           # default mean
              )

# if there's multi-level index after .pivot() / .pivot_table() and you don't want it
df.pivot(...).reset_index()

### 2.3 beyond melt and pivot
这节只讲了一种情况

In [None]:
# using the **str attribute** of columns of type object.

tb_melt['gender'] = tb_melt.variable.str[0]  # create a new col 
                                             # pandas' vectorized string slicing  

## 3 Combining data for analysis
### 3.1 concatenating data

In [None]:
# stack data vertically
pd.concat([df1, df2])  # keep original index in both df

pd.concat([df1, df2], 
          ignore_index=True  # discard origin index, reindex with a rangeIndex
         )

### 3.2 finding and concatenating data
deals with the situation where you want to load too many files  
and there's a pattern in file names.

In [None]:
# globbing
    # wildcards: * ?
        # zero or more time of any char: *
        # exactly one time of any char: ?
import glob

csv_files = glob.glob('*.csv')

csv_files  # get a list of all files with name match the pattern

lst_data = []

for file_name in csv_files:
    data = pd.read_csv(filename)
    lst_data.append(data)       # will finally get a list of dataframes
    
pd.concat(lst_data)     # concat the list of dataframe into a single one

In [11]:
# test
import glob 
pdf_files = glob.glob('*.pdf')
jnb_files = glob.glob('note_*')
print(pdf_files, '\n', jnb_files)

['ch1_slides.pdf', 'ch2_slides.pdf', 'ch3_slides.pdf', 'ch4_slides.pdf', 'ch5_slides.pdf'] 
 ['note_pd1_pandas_foundations_note.ipynb', 'note_pd2_merging_df_with_pandas.ipynb', 'note_pd3_manipulating_df_with_pandas.ipynb']


### 3.3 merge data
详细内容见merging df with pandas course

In [None]:
pd.merge(left=state_populations, right=state_codes,  
          on=None, 
          left_on='state', right_on='name')  # by default an outer merge

## 4 cleaning data for analysis

### 4.1 data type

In [None]:
# in df['treatment b'], want object, get int
type(df['treatment b']) # get int

df['treatment b'] = df['treatment b'].astype(str)   # object dtype is encoded as strings
                                                    # now df['treatment b'] should be of type object

In [None]:
# cols with limited unique levels, category dtype can save memory & make other operations faster
df['sex'] = df['sex'].astype('category')

In [None]:
# want int, get string
df['treatemnt a'].values  # get (['-', '12', '24'], dtype=object)

# If you expect the data type of a column to be numeric (int or float), 
# but instead it is of type object, 
# this typically means that there is a non numeric value in the column, 
# which also signifies bad data.

df['treatment a'] = pd.to_numeric(df['treatment a'],
                                  errors='coerce' # converting '-' to num will cause error
                                                  # tell pandas not to stop at error but use NaN
                                 )

### 4.2 Using regular expressions to clean strings  
pattern matching similiar to globbing

In [None]:
# 12345678901 
\d*   # decimal digits one or more times

# $12345678901
\$\d*   # literally a dollar sign and decimal digits one or more times

# $12345678901.42 
\$\d*\.\d*   # add a literally period sign .

# $12345678901.42 
\$\d*\.\d{2}   # specifying there's 2 decimal digits after period

# $12345678901.99
^\$\d*\.\d{2}$  # ^ indicates start of the str
                # $ indicates end of the string

In [None]:
# *: zero or more time
# +: one or more time 

In [23]:
import re

pattern = re.compile('\$\d*\.\d{2}')

result = pattern.match('$17.89')

print(result, '\n',bool(result))
print('\n')

result2 = re.findall('\$\d*\.\d{2}',     # get a list of all matching string
                     '$17.89 lalalla $12345678901.99 heyhey $1234')
print(result2)

<_sre.SRE_Match object; span=(0, 6), match='$17.89'> 
 True


['$17.89', '$12345678901.99']


### 4.3 using functions to clean data

In [None]:
# define function along with .apply()

# example code:

# Define recode_sex()
def recode_sex(sex_value):

    # Return 1 if sex_value is 'Male'
    if sex_value == 'Male':
        return 1
    
    # Return 0 if sex_value is 'Female'    
    elif sex_value == 'Female':
        return 0
    
    # Return np.nan    
    else:
    
        return np.nan

# Apply the function to the sex column
tips['sex_recode'] = tips.sex.apply(recode_sex)


### 4.4 duplicate and missing data

In [None]:
# duplicate data
df.drop_duplicate()  # drop exact same rows

In [None]:
# missing data NaN
    # leave as-is
    # drop them
    # fill missing vals
    
    # drop them
df.info()  # only count non-null vals 
df.dropna()  # drop any row contain NaN

    # fill missing vals
tips_nan['sex'].fillna('missing')            # fill with a 'missing' indicator
tips_nan[['total_bill', 'size']].fillna(0)   # fill with 0

tips_nan['tip'].fillna(tips_nan['tip'].mean())  # fill with mean of the series

### 4.5 testing with asserts  
* programmatically checking if sth wrong
* if we drop or fill NaNs, we expect 0 missing vals  
write assert to verify this  
  
(不过没太懂assert有毛用，不加assert不一样有True | False 的return value吗）

In [24]:
# if the assertions evals True, nothing will happen
assert 1 == 1

In [25]:
# if the assertion evals False, AssertionError
assert 1 == 2

AssertionError: 