# `pandas` Part 5: Finding and Replacing Values

# Learning Objectives
## By the end of this tutorial you will be able to:
1. Check datatypes with `dtype`
2. Find and replace missing (null) values with `fillna()`
 

## Files Needed for this lesson: `winemag-data-130k-v2.csv`
>- Download this csv from Canvas prior to the lesson

## The general steps to working with pandas:
1. import pandas as pd
2. Create or load data into a pandas DataFrame or Series
3. Reading data with `pd.read_`
>- Excel files: `pd.read_excel('fileName.xlsx')`
>- Csv files: `pd.read_csv('fileName.csv')`
>- Note: if the file you want to read into your notebook is not in the same folder you can do one of two things:
>>- Move the file you want to read into the same folder/directory as the notebook
>>- Type out the full path into the read function
4. After steps 1-3 you will want to check out your DataFrame
>- Use `shape` to see how many records and columns are in your DataFrame
>- Use `head()` to show the first 5-10 records in your DataFrame

# Analytics Project Framework Notes
## A complete and thorough analytics project will have 3 main areas
1. Descriptive Analytics: tells us what has happened or what is happening. 
>- The focus of this lesson is how to do this in python.
>- Many companies are at this level but not much more than this
>- Descriptive statistics (mean, median, mode, frequencies)
>- Graphical analysis (bar charts, pie charts, histograms, box-plots, etc)
2. Predictive Analytics: tells us what is likely to happen next
>- Less companies are at this level but are slowly getting there
>- Predictive statistics ("machine learning (ML)" using regression, multi-way frequency analysis, etc)
>- Graphical analysis (scatter plots with regression lines, decision trees, etc)
3. Prescriptive Analytics: tells us what to do based on the analysis
>- Synthesis and Report writing: executive summaries, data-based decision making
>- No analysis is complete without a written report with at least an executive summary
>- Communicate results of analysis to both non-technical and technical audiences

# Descriptive Analytics Using `pandas`

# Initial set-up steps
1. import modules and check working directory
2. Read data in
3. Check the data

In [1]:
import os, pandas as pd
path = os.getcwd()
path

'C:\\Users\\anton\\Documents\\CU-Python\\week_12'

# Step 2 Read Data Into a DataFrame with `read_csv()`
>- file name: `winemag-data-130k-v2.csv`
>- Set the index to column 0
>- Note: I'm using the full path name on my laptop because i have the file in a different folder than my ipynb for this lesson

In [12]:
wineReviews = pd.read_csv("C:\\Users\\anton\\Documents\\CU-Python\\week_10\\winemag-data-130k-v2.csv", 
index_col=0)

### Check how many rows, columns, and data points are in the `wine_reviews` DataFrame
>- Use `shape` and indices to define variables
>- We can store the values for rows and columns in variables if we want to access them later

In [13]:
rows = wineReviews.shape[0]
columns = wineReviews.shape[1]

print(f''' The wine dataset has:
    {rows} rows and {columns} columns''')

 The wine dataset has:
    129971 rows and 13 columns


### Check a couple of rows of data

In [14]:
wineReviews.head(2)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos


### Another step in understanding the data you are working with is checking the data types
>- The analysis will differ depending on the data type
>>- For example, only number fields can be averaged
>>- Text/string analysis usually involves counts/frequencies 

### Checking datatypes with `dtype` and `dtypes`
>- General syntax for `dtype`: dataFrame.field.dtype
>>- Returns the datatype for one field
>- General syntax for `dtypes`: dataFrame.dtypes
>>- Returns the datatypes for all the fields in a dataframe

###  Check one field with `dtype`

In [18]:
wineReviews.country.dtype #object means string 

dtype('O')

### Check all the fields in the data frame with `dtypes`

In [16]:
wineReviews.dtypes

country                   object
description               object
designation               object
points                     int64
price                    float64
province                  object
region_1                  object
region_2                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dtype: object

### Question: What is the average price of all wines? 

In [22]:
round(wineReviews.price.mean(),2)

35.36

### Question: How many wines are there per country the data frame? 

In [26]:
wineReviews.country.value_counts()#Value counts

US                        54504
France                    22093
Italy                     19540
Spain                      6645
Portugal                   5691
Chile                      4472
Argentina                  3800
Austria                    3345
Australia                  2329
Germany                    2165
New Zealand                1419
South Africa               1401
Israel                      505
Greece                      466
Canada                      257
Hungary                     146
Bulgaria                    141
Romania                     120
Uruguay                     109
Turkey                       90
Slovenia                     87
Georgia                      86
England                      74
Croatia                      73
Mexico                       70
Moldova                      59
Brazil                       52
Lebanon                      35
Morocco                      28
Peru                         16
Ukraine                      14
Macedoni

##### Another way to get wines by country using `groupby`: 

In [30]:
wineReviews.groupby(['country']).country.count().sort_values(ascending=False)

country
US                        54504
France                    22093
Italy                     19540
Spain                      6645
Portugal                   5691
Chile                      4472
Argentina                  3800
Austria                    3345
Australia                  2329
Germany                    2165
New Zealand                1419
South Africa               1401
Israel                      505
Greece                      466
Canada                      257
Hungary                     146
Bulgaria                    141
Romania                     120
Uruguay                     109
Turkey                       90
Slovenia                     87
Georgia                      86
England                      74
Croatia                      73
Mexico                       70
Moldova                      59
Brazil                       52
Lebanon                      35
Morocco                      28
Peru                         16
Ukraine                      14


## What are the descriptive analytics for wine price?
>- Include the 10th and 90th percentiles of wines in the analysis
>- Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

In [32]:
wineReviews.price.describe(percentiles = [.1, .9]) #can pass a list of percentiles, other info as well

count    120975.000000
mean         35.363389
std          41.022218
min           4.000000
10%          12.000000
50%          25.000000
90%          65.000000
max        3300.000000
Name: price, dtype: float64

## What are the descriptive analytics for country?  

In [34]:
wineReviews.country.describe() #stored as object, not int so it has the applicable analytics

count     129908
unique        43
top           US
freq       54504
Name: country, dtype: object

## What are the desriptive anlyatics for all numerical fields in the data frame? 
>- Note: By default describe() returns all numerical fields when called on a DataFrame. 
>- Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

In [36]:
wineReviews.describe() #defaults to only describing the int fields

Unnamed: 0,points,price
count,129971.0,120975.0
mean,88.447138,35.363389
std,3.03973,41.022218
min,80.0,4.0
25%,86.0,17.0
50%,88.0,25.0
75%,91.0,42.0
max,100.0,3300.0


#### Question: Why would points and price have different count values? 

## What are the descriptive analytics for all non-numeric fields in the DataFrame? 
>- Note: we can use `select_dtypes` with the parameter `include='object'` to only include string fields.
>>- `select_dtypes(include='object')`
>- Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html#pandas.DataFrame.select_dtypes


In [38]:
wineReviews.select_dtypes(include='object').describe()

Unnamed: 0,country,description,designation,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
count,129908,129971,92506,129908,108724,50511,103727,98758,129971,129970,129971
unique,43,119955,37979,425,1229,17,19,15,118840,707,16757
top,US,"Seductively tart in lemon pith, cranberry and ...",Reserve,California,Napa Valley,Central Coast,Roger Voss,@vossroger,Gloria Ferrer NV Sonoma Brut Sparkling (Sonoma...,Pinot Noir,Wines & Winemakers
freq,54504,3,2009,36247,4480,11065,25514,25514,11,13272,222


## Finally, to include every field in the data frame:
>- use `describe(include='all')

# Notice how the fields in `wineReviews` vary in count? 
>- A common occurrence in datasets is missing (aka null) values
>- We can use `pd.isnull` to see all the null values for a particular field
>- We can use `pd.notnull()` to see only non-missing values for a particular field

#### Q: What are all the wines with missing country values?

## Now, suppose we want to replace a missing value with `Unknown`
>- We can use a pandas function called `fillna()` and pass the value "Unknown" to it

#### Replace null values for `region_2` with 'Unknown'

### To store the non-null values in a DataFrame...

# Using `replace()` to replace specific values
>- Suppose a taster in the dataset gets a new twitter handle then we can use `replace()` to update this data

#### Task: Kerin O'Keefe  is changing her twitter handle from `@kerinokeefe` to `@kerino`
>- Use pandas `replace()` to make the change in our DataFrame