# Intermediate Data Visualization with Seaborn

## Seaborn histplot
- The **histplot** is similar to the histogram.
- By default, generates a histogram but can also generate other complex plots.
## Seaborn displot
- The **displot** levarages the **histplot** and other functions for distribution plots.
- By default, it generates a histogram but can also generate other plot types.
## Creating a histogram
- The **displot** function has multiple optional arguments.
- You can overlay a KDE plot on the histogram and specify the number of bins to use.
## Alternative data distributions 
- A rug plot is an alternative way to view the distribution of data including small tickmarks along the x-axis.
- A kde curve and rug plot can be combined.
## Further plot types
- The displot function uses several functions including **kdeplot**, **rugplot** and **ecdfplot**.
- The **ecdfplot** shows the cumulative distribution of the data.
## Introducition to regplot
- The **regplot** function generates a scatter plot with a regression line.
- Usage is similar to the **displot**.
- The **data** and **x** and **y** variables must be defined.
## lmplot faceting
- Plotting multiple graphs while changing a single variable


In [None]:
# Creating a histogram

import seaborn as sns
sns.displot(df['alcohol'], kde=True, bins=10)

# Alternative data distributions

sns.displot(df['alcohol'], kind='kde', rug=True, fill=True)

# Further plot types

sns.displot(df['alcohol'], kind='ecdf')

# Introduction to regplot

sns.regplot(data=df, x="alcohol", y="pH")

# lmplot() builds on top of the base regplot()
'''
- regplot (low level)
sns.regplot(data=df, 
            x="alcohol", 
            y="quality")
- lmplot (high level)
sns.lmplot(data=df, 
           x="alcohol", 
           y="quality")
They look similar but the second one is more powerful, it's more flexible.
'''
# lmplot faceting
'''
- Organize data by colors (hue)
sns.lmplot(data=df, 
           x="quality", 
           y="alcohol", 
           hue="type")
- Organize data by columns (col)
sns.lmplot(data=df, 
           x="quality", 
           y="alcohol", 
           col="type")
'''


## Setting Styles
- Seaborn has default configurations that can be applied with **sns.set()**.
- These styles can override matplotlib and pandas plots as well.
## Removing axes with despine()
- Sometimes plots are improved by removing elements.
- Seaborn contains a shortcut for removing the spines of a plot.
## Colors in Seaborn
- Seaborn supports assigning colors to plots using **matplotlib** color codes.
- Seaborn uses the **set_palette()** function to define a palette.
## Displaying Palettes
- **sns.palplot()** function displays a palette
- sns.color_palette
## Defining Custom Palettes
Circular colors = when the data is not ordered
- **sns.palplot(sns.color_palette("Paired", 12))**
Sequential colors = when the data has a consistent range from high to low
- **sns.palplot(sns.color_palette("Blues", 12))**
Diverging colors = when both the low and high values are interesting
- **sns.palplot(sns.color_palette("BrBG", 12))**
## Customizing with matplotlib
- Most customization available through **matplotlib Axes** objects.
- **Axes** can be passed to seaborn functions.
- It is possible to combine and configure multiple plots.

In [1]:
# Using seaborn styles

sns.set()
df['Tuition'].plot.hist()

for style in ['white','dark','whitegrid','darkgrid','ticks']:
    sns.set_style(style)
    sns.displot(df['Tuition'])
    plt.show()

# Removing axes with despine()

sns.set_style('white')
sns.displot(df['Tuition'])
sns.despine(left=True)
# The default is to remove the top and right lines

# Colors in Seaborn

sns.set(color_codes=True)
sns.displot(df['Tuition'], color='g')

palettes = ['deep', 'muted', 'pastel', 'bright', 'dark','colorblind']
for p in palettes:
    sns.set_palette(p) 
    sns.displot(df['Tuition'])

# Displaying Palettes

palettes = ['deep', 'muted', 'pastel', 'bright','dark','colorblind']
for p in palettes: 
    sns.set_palette(p) 
    sns.palplot(sns.color_palette()) 
    plt.show()

# Combining Plots

fig, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, 
                               sharey=True, figsize=(7,4))

sns.histplot(df['Tuition'], stat='density', ax=ax0)
sns.histplot(df.query('State == "MN"')['Tuition'], stat='density', ax=ax1)

ax1.set(xlabel='Tuition (MN)', xlim=(0, 70000))
ax1.axvline(x=20000, label='My Budget', linestyle='--')
ax1.legend()


## Categorical Data
Data which takes on a limited and fixed number of values, normally combined with numeric data.
Examples include:
- Geography (country, state, region)
- Gender
- Ethnicity
- Blood type
- Eye color
## Plot types
- **Show each observation**: the 1° category includes **stripplot()** and **swarmplot()**, showing all of the individual observations on the plot.
- **Abstract representations**: the 2° category contains **boxplot()**, **violinplot()** and **boxenplot()**, showing an abstract representation of the categorical data.
- **Statistical estimates**: the 3° category has **barplot()**, **pointplot()** and **countplot()**, showing statistical estimates of the categorical variables.

In [None]:
# Plots of each observation

# Stripplot

sns.stripplot(data=df, y="DRG Definition", 
              x="Average Covered Charges", 
              jitter=True)
# Swarmplot

sns.swarmplot(data=df, 
             y="DRG Definition", 
             x="Average Covered Charges")
# Boxplot

sns.boxplot(data=df, 
            y="DRG Definition", 
            x="Average Covered Charges")
# Violinplot

sns.violinplot(data=df, 
               y="DRG Definition", 
               x="Average Covered Charges")
# Boxenplot

sns.boxenplot(data=df, 
              y="DRG Definition", 
              x="Average Covered Charges")
# Barplot

sns.barplot(data=df, 
            y="DRG Definition", 
            x="Average Covered Charges", 
            hue="Region")
# Pointplot

sns.pointplot(data=df, 
              y="DRG Definition", 
              x="Average Covered Charges", 
              hue="Region")
# Countplot

sns.countplot(data=df, 
              y="DRG_Code", 
              hue="Region")


## Evaluating regression with residplot()
- A residual plot is useful for evaluating the fit of a model
- Ideally, the residual values in the plot should be plotted randomly across the horizontal line.
## Polynomial regression
- Seaborn supports polynimial regression using the **order** parameter.
## Estimators
- In some cases, an **x_estimator** can be useful for highlighting trends
## Getting data in the right format
- Seaborn's **heatmap()** function requires data to be in a grid format
- Pandas crosstab() is frequently used to manipulatethe data
## Pairwise relationships
- **PairGrid** shows pairwise relationships between two data elements
- The PairGrid supports defining the type of plots that can be displayed on the diagonals.


In [None]:
# valuating regression with residplot()

sns.residplot(data=df, x='temp', y='total_rentals')

# Polynomial regression

sns.regplot(data=df, x='temp', y='total_rentals',
            order=2)

# residplot with polynomial regression

sns.residplot(data=df, x='temp', y='total_rentals', 
              order=2)

# Categorical values

sns.regplot(data=df, x='mnth', y='total_rentals', 
            x_jitter=.1, 
            order=2)

# Estimators

sns.regplot(data=df, x='mnth', y='total_rentals', 
            x_estimator=np.mean, 
            order=2)

# Getting data in the right format

pd.crosstab(df["mnth"], df["weekday"],
            values=df["total_rentals"],aggfunc="mean").round(0)

# Build a heatmap

sns.heatmap(pd.crosstab(df["mnth"], df["weekday"],          
                        values=df["total_rentals"], aggfunc="mean"))

# Customize a heatmap

sns.heatmap(df_crosstab, annot=True, fmt="d", 
            cmap="YlGnBu", cbar=False, linewidths=.5, center=df_crosstab.loc[9, 6])
# 'annot=True' to turn on annotations in the individual cells.
# 'fmt' option ensures that the results are displayed as integers.
# 'cmap' of yellow, green and blue to change the shading that we use.
# 'cbar=False' the color bar is not displayed.
# 'linewidths' puts small spacing between the cells so that the values are simpler to view.
# 'center' overall color scheme is shifted towards yellows instead of blues.

# Creating a PairGrid

g = sns.PairGrid(df, vars=["Fair_Mrkt_Rent","Median_Income"])
g = g.map(sns.scatterplot)
# we do not define the row and column parameters instead we define the variables

# Customazing the PairGrid diagonals

g = sns.PairGrid(df, vars=["Fair_Mrkt_Rent", "Median_Income"])
g = g.map_diag(sns.histplot)
g = g.map_offdiag(sns.scatterplot)
# 'map_diag()' to define the plotting function for the main diagonal.
# 'map_offdiag()' defines the other diagonal.

# pairplot() (is a short cut for the PairGrid)

sns.pairplot(df, vars=["Fair_Mrkt_Rent","Median_Income"], 
             kind="reg", diag_kind="hist")

# Basic JointGrid

g = sns.JointGrid(data=df, x="Tuition", y="ADM_RATE_ALL") 
g.plot(sns.regplot, sns.histplot)

# Advanced JointGrid

g = sns.JointGrid(data=df, x="Tuition", y="ADM_RATE_ALL") 
g = g.plot_joint(sns.kdeplot)
g = g.plot_marginals(sns.kdeplot, shade=True)

# jointplot() (is a short cut for the JointGrid)

sns.jointplot(data=df, x="Tuition", y="ADM_RATE_ALL", kind='hex')


# Exploratory Data Analysis

In [None]:
# Data summarization

# Aggregating ungrouped data
books.agg(['mean', 'std'])

# Specifuing aggreagations for columns
books.agg({'rating': ['mean', 'std'], 'year':['median']})

# Named summary columns
books.groupby('genre').agg(mean_rating=('rating', 'mean'), 
                           std_rating=('rating', 'std'), 
                           median_year=('year', 'median'))

## Data Cleaning and Imputation
### Strategies for addressing missing data
Drop missing values
- 5% or less of total values
Impute mean, median or mode
- Depends on distribution and context
Impute by sub-groups
- e.g. different experience levels have different median salary

In [None]:
# Checking for missing values
print(salaries.isana().sum())
'''
output:
Working_Year            12
Designation             27
Experience              33
Employment_Status       31
Employee_Location       28
Company_Size            40
Remote_Working_Ratio    24
Salary_USD              60
'''
# Dropping missing values
threshold = len(salaries) * 0.05
'''
output: 30
'''
# In this case, we can use Boolean indexing to filter 
# for columns with missing values less than or equal 
# to this threshold, storing them as a variable.
cols_to_drop = salaries.columns[salaries.isna().sum() <= threshold]
'''
output: Index(['Working_Year', 'Designation', 'Employee_Location',                 'Remote_Working_Ratio'], dtype='object')
'''
# now, we need to use the dropna():
salaries.dropna(subset=cols_to_drop, inplace=True)
# after that we filter the remaining columns with missing values


# Checking the remaining columns with missing values
cols_with_missing_values = salaries.columns[salaries.isna().sum() > 0]
'''
Index(['Experience', 'Employment_Status', 'Company_Size', 'Salary_USD'],     dtype='object')
'''
# Imputing a summary statistic
for col in cols_with_missing_values[:-1]:
    salaries[col].fillna(salaries[col].mode()[0])


# Checking the remaining missing values
print(salaries.isna().sum())
'''
Working_Year             0
Designation              0
Experience               0
Employment_Status        0
Employee_Location        0
Company_Size             0
Remote_Working_Ratio     0
Salary_USD              41
'''
# Calculating the median salary for each group
salaries_dict = salaries.groupby('Experience')['Salary_USD'].median().to_dict()
'''output: 'Entry': 55380.0, 'Executive': 135439.0, 'Mid': 74173.5, 'Senior': 128903.0}'''

# Imputing by sub-groups
salaries['Salary_USD'] = salalaries['Salary_USD'].fillna(salaries['Experience'].map(salaries_dict))
# we call the map() method because we want to apply the function to ALL elements in the sequence.
'''
output:
Working_Year            0
Designation             0
Experience              0
Employment_Status       0
Employee_Location       0
Company_Size            0
Remote_Working_Ratio    0
Salary_USD              0
'''

## Converting an analyzing categorical data

In [None]:
# Previewing the data
print(salaries.select_dtypes('object').head())
print(salaries['Designation'].nunique)
# we can count how many job titles there are

# Extracting value from categories
'''
- Current format limits our ability to generate insights
- pandas.Series.str.contains()
1. Search a column for a specific string or multiple strings
'''
salaries['Designation'].str.contains('Scientist')

# Finding multiple phrases in strings
salaries['Designation'].str.contains('Machine Learning|AI')
'''
we neeed to include a pipe ('|') between our two phrases, notice that we avoid 
spaces before or after the pipe - if we include them then str.contains will only
capture values that have a space
'''

# Creating the categorical column
job_categories = ["Data Science", "Data Analytics", "Data Engineering", 
                  "Machine Learning","Managerial", "Consultant"]

data_science = "Data Scientist|NLP"
data_analyst = "Analyst|Analytics"
data_engineer = "Data Engineer|ETL|Architect|Infrastructure"
ml_engineer = "Machine Learning|ML|Big Data|AI"
manager = "Manager|Head|Director|Lead|Principal|Staff"
consultant = "Consultant|Freelance"

search_strings = [data_science, data_analyst, data_engineer, ml_engineer, manager, consultant]
conditions = []
for search_string in search_strings:
    conditions.append(salaries["Designation"].str.contains(search_string))
# now we can create the categorical column
salaries["Job_Category"] = np.select(conditions,
                                     job_categories, 
                                     default="Other")


## Working with numeric data
- Remove comma values in Salary_In_Rupees
- Convert the column to float data type
- Create a new column by converting the currency

In [None]:
# The original dataset
print(salaries["Salary_In_Rupees"].head())
'''
output:
0    20,688,070.00
1     8,674,985.00
2     1,591,390.00
3    11,935,425.00
4     5,729,004.00
Name: Salary_In_Rupees, dtype: object
'''
# Converting strings to numbers
# pd.Series.str.replace('characters to remove', 'characters to replace them with')
salaries['Salary_In_Rupees'] = salaries['Salary_In_rupees'].str.replace(',', '')
print(salary['Salary_In_Rupees'].head())
'''
output:
1    20688070.00
2     8674985.00
3     1591390.00
4    11935425.00
5     5729004.00
Name: Salary_In_Rupees, dtype: object
'''
# Converting strings to number
salaries['Salary_In_Rupees'] = salaries['Salary_In_rupees'].astype(float)
# 1 Indian Rupee = 0.012 US Dollars
salaries['Salary_USD'] = salaries['Salary_In_rupees'] * 0.012
print(salaries['Salary_USD'].head())
'''
output:
0        248256.840
1        104099.820
2         19096.680
3        143225.100
4         68748.048
'''
# Adding summary statistics into a DataFrame
'''
Group by  --> Select   --> Call     -->  Apply lamba
Experience    Salart_USD   transform()    function
'''
salaries['std_dev'] = salaries.groupby('Experience') \['Salary_USD'].transform(lambda x: x.std())
# we use a backslash ('\') to split our code over two lines

## Hadling outliers
### Using the interquartile range (IQR)
- IQR = 75th - 25th percentile
- Upper Outliers > 75th percentile + (1.5 * IQR)
- Lower Outliers < 25th percentile - (1.5 * IQR)

In [None]:
# Step by step

# First, using descriptive statistcs to find if there's outliers
print(salaries['Salary_USD'].describe())

# Visualazing them in boxplots
sns.boxplot(data=salaries, x='Salary_USD')
plt.show()

# Identifying thresholds
seventy_fifth = salaries['Salary_USD'].quantile(0.75)
twent_fifth = salaries['Salary_USD'].quantile(0.25)
salaries_iqr = seventy_fifth - twenty_fifth

# Identifying outliers
upper = seventy_fifth + (1.5 * salaries_iqr)
lower = twenty_fifth - (1.5 * salaries_iqr)

# Subsetting our data
salaries[(salaries["Salary_USD"] < lower) | (salaries["Salary_USD"] > upper)] \        [["Experience", "Employee_Location", "Salary_USD"]]

# Dropping outliers
no_outliers = salaries[(salaries["Salary_USD"] > lower) | (salaries["Salary_USD"] < upper)]
# we can remove outliers by modifying the syntax we used to subset our data


## Relationships in Data

In [None]:
# Importing DataTime data
divorce = pd.read_csv('divorce.csv', parse_dates=['marriage_date'])
# With parse_dates argument we can setting it equal to a list of column names that should be interpreted as DataTime data.

# Converting to DataTime data (after open the csv file)
divorce['marriage_date'] = pd.to_datetime(divorce['marriage_date'])

# Creating DataTime data (if the informations was dispersed)
divorce['marriage_date'] = pd.to_datatime(divorce[['month', 'day', 'year']])

# Extracting parts of a full date (dt.month / dt.day / dt.year)
divorce["marriage_month"] = divorce["marriage_date"].dt.month


In [None]:
# Correlation heatmaps
sns.heatmap(divorce.corr(), annot=True)
'''
- A heatmap has the benefit of color coding so 
that strong positive and negative correlations.
- The Pearson coefficient "df.corr()" only 
describes the linear correlation between variables.
'''
# Pairplots
sns.pairplot(data=divorce, vars=['income_man', 'income_woman', 'marriage_duration'])
'''
- It's important to complement our correlation calculations with scatter plots, because they can see non-linear relationships.
- Pairplot plots all parwise relationships between numerical variables in one visualization.
- But we can limit the number of plotted relationships by setting the 'vars' argument equal to the variables of interest.
'''
# Kernel Density Estimate (KDE) plots
sns.kdeplot(data=divorce, x='marriage_duration', hue='education_man', 
            cut=0, cumulative=True)
'''
- Similar to histograms, KDEs allow us to visualize distributions.
- KDEs are considered more interpretable, though, especially when multiple distributions are shown.
- To improve our plot, we can pass the argument 'cut'; cut tells Seaborn how far past the minimum and maximum data values the curve should go when smoothing is applied.
_ If we're interested in the cumulative distribution function, we can set the cumulative keyword argument to True.
'''

## Turning Exploratory Analysis into Action
### Class imbalance
Ex: In a sample of a thousand people, 50 were married, 700 were divorced, 250 were single.
- This is an example of class imbalance, where one class occurs more frequently than others.
- This can bias results, particularly if this class does not occur more frequently in the population.

In [None]:
# Cross-tabulation (which enables us to examine the frequency of combinations of classes)
pd.cross(planes['Source'], planes['Destination'], 
         values=planes['Price'], aggfunc='median')

# Creating categories (We can group numeric data and label them as classes)

# Descriptive statistics
twenty_fifth = planes["Price"].quantile(0.25)
median = planes["Price"].median()
seventy_fifth = planes["Price"].quantile(0.75)
maximum = planes["Price"].max()

# Labels and bins
labels = ["Economy", "Premium Economy", "Business Class", "First Class"]bins = [0, twenty_fifth, median, seventy_fifth, maximum]

# Price categories
planes["Price_Category"] = pd.cut(planes["Price"], 
                                  labels=labels, 
                                  bins=bins)


# Working with Categorical Data in Python

## Differences between Categorical and Numerical data
### Categorical
- Finite number of groups (or categories).
- These categories are usually fixed or known (eye color, hair color, etc.).
- Known as qualitative data.

### Numerical
- Known as quantitative data.
- Expressed using a numerical value.
- Is usually a measurement (height, weight, IQ, etc.).



## There's two types of Categorical data
### Ordinal
- Categorical variables that have a natural order:
- ex: Strongly Disagree / Disagree / Neutral / Agree / Strongly Agree
### Nominal
- Categorical variables that cannot be placed into a natural order:
- ex: Blue / Green / Red / Yellow / Purple

In [None]:
# Setting the type of an object column as a categorical column

# Default dtype
adult["Marital Status"].dtype
'''
output: dtype('O')
'''
# Set as categorical:
adult["Marital Status"] = adult["Marital Status"].astype("category")
adult["Marital Status"].dtype
'''
output: CategoricalDtype(categories=[' Divorced', ' Married-AF-spouse',' Married-civ-spouse', ' Married-spouse-absent', ' Never-married',' Separated', ' Widowed'], ordered=False)
'''
# Creating a categorical Series
my_data = ["A", "A", "C", "B", "C", "A"]

my_series1 = pd.Series(my_data, dtype="category")
print(my_series1)

# Another way to create a categorical Series
my_data = ["A", "A", "C", "B", "C", "A"]

my_series2 = pd.Categorical(my_data, categories=["C", "B", "A"], ordered=True)
print(my_series2)

In [None]:
# Why do we use categorical: memory saver
adult = pd.read_csv("data/adult.csv")
adult["Marital Status"].nbytes
'''
output: 260488
'''
adult["Marital Status"] = adult["Marital Status"].astype("category")
adult["Marital Status"].nbytes
'''
output: 32617
'''

In [None]:
# Specifying columns

# Option 1: only runs .sum() on two columns
adult.groupby(by=["Above/Below 50k"])['Age', 'Education Num'].sum()

# Option 2: runs .sum() on all numeric columns and then subsets
adult.groupby(by=["Above/Below 50k"]).sum()[['Age', 'Education Num']]

## Categorical pandas Series
### Series.cat.method_name
Common parameters:
- new_categories: a list of categories
- inplace: Boolean (whether or not the update should overwrite the Series)
- ordered: Boolean (whether or not the categorical is treated as an ordered categorical)

In [None]:
# Setting Series categories
dogs['coat'] = dogs['coat'].cat.set_categories(
    new_categories=['short', 'medium', 'long']), 
    ordered=True
) 
# Adding categories
dogs["likes_people"] = dogs["likes_people"].astype("category")
dogs["likes_people"] = dogs["likes_people"].cat.add_categories(
    new_categories=["did not check", "could not tell"]
)
# Check categories
dogs["likes_people"].cat.categories
'''
output: Index(['no', 'yes', 'did not check', 'could not tell'], dtype='object')
'''
# Removing categories
dogs["coat"] = dogs["coat"].astype("category")
dogs["coat"] = dogs["coat"].cat.remove_categories(removals=["wirehaired"])

## Updating Categories

In [None]:
# Renaming categories
'''
Series.cat.rename_categories(new_categories=dict)
'''
# Make a dictionary:
my_changes = {"Unknown Mix": "Unknown"}

# Rename the category:
dogs["breed"] = dogs["breed"].cat.rename_categories(my_changes)

# Renaming categories with a function
dogs['sex'] = dogs['sex'].cat.rename_categories(lambda c: c.title())
dogs['sex'].cat.categories
'''
output: Index(['Female', 'Male'], dtype='object')
'''
# Collapsing categories setup
update_colors = {
    "black and brown": "black",
    "black and tan": "black",
    "black and white": "black",
}
dogs["main_color"] = dogs["color"].replace(update_colors)
# the problem with this method is because the data loses its categorical characteristic. So, we need to covert back to categorical
dogs['main_color'] = dogs['main_color'].astype('category')

## Reording Categories

In [None]:
# Reording example
dogs["coat"].cat.reorder_categories(
    new_categories = ['short', 'medium', 'wirehaired', 'long'], 
    ordered=True, 
    inplace=True
)

## Cleaning and accessing data
### Possible issues with categorical data
1) Inconsistent values: "Ham", "ham", " Ham"
2) Misspelled values: "Ham", "Hma"
3) Wrong dtype: df['Our Column'].dtype


In [None]:
# Identifying issues
'''
Series.cat.categories
Series.value_counts()
'''
# Fixing issues> whitespace
dogs["get_along_cats"] = dogs["get_along_cats"].str.strip()

# Fixing issues: capitalization (.title(), .upper(), .lower())
dogs["get_along_cats"] = dogs["get_along_cats"].str.title()

# Fixing issues: misspelled words
replace_map = {"Noo": "No"}
dogs["get_along_cats"].replace(replace_map, inplace=True)

# Using the str accessor object
dogs["breed"].str.contains("Shepherd", regex=False)

# Accessing data with loc
dogs.loc[dogs["get_along_cats"] == "Yes", "size"]

## Pitfalls and Encoding
### Using categories can be frustrating
- Using the **.str** accessor object to manipulate data converts the Series to an object.
- The **.apply()** method outputs a new Series as an object.
- The common methods of adding, removing, replacing, or setting categories do not allhandle missing categories the same way.
- NumPy functions generally do not work with categorical Series.


In [None]:
# Huge memory savings
used_cars['manufacturer_name'].describe()
'''
output:
count          38531
unique            55
top       Volkswagen
freq            4243
Name: manufacturer_name, dtype: object
'''
print("As object: ", used_cars['manufacturer_name'].nbytes)
print("As category: ", used_cars['manufacturer_name'].astype('category').nbytes)
'''
output:
As object: 308248
As category: 38971
'''
# Little memory savings 
used_cars['odometer_value'].astype('object').describe()
'''
output:
count      38531
unique      6063
top       300000
freq        1794
Name: odometer_value, dtype: int64
'''
print(f"As float: {used_cars['odometer_value'].nbytes}")
print(f"As category: {used_cars['odometer_value'].astype('category').nbytes}")
'''
output:
As float: 308248
As category: 125566
'''

In [None]:
# Using NumPy arrays
# Don't do this
used_cars['number_of_photos'] = used_cars['number_of_photos'].astype("category")
used_cars['number_of_photos'].sum()  # <--- Gives an Error
'''
output:
TypeError: Categorical cannot perform the operation sum
'''
# Do this
used_cars['number_of_photos'].astype(int).sum()
# Note: .str converts the column to an array
used_cars["color"].str.contains("red")
'''
output:
0        False
1        False
...
'''

## Label encoding
### What is label enconding?
#### The basics:
- Codes each category as an integer from 0 through n - 1, where n is the number ofcategories
- A -1 code is reserved for any missing values
- Can save on memory
- Often used in surveys

#### The drawback:
- Is not the best encoding method for machine learning (see next lesson)


In [None]:
# Creating codes
used_cars['manufacturer_name'] = used_cars['manufacturer_name'].astype("category")
# Using .cat.codes we can get a label encoding, which will convert the values to integers
used_cars['manufacturer_code'] = used_cars['manufacturer_name'].cat.codes

# Creating a code book
codes = used_cars['manufacturer_name'].cat.codes
categories = used_cars['manufacturer_name']

name_map = dict(zip(codes, categories))
print(name_map)
'''
output:
{45: 'Subaru', 
 24: 'LADA', 
 12: 'Dodge', 
 ...}
'''

In [None]:
# Using a code book

# Creating the codes:
used_cars['manufacturer_code'] = used_cars['manufacturer_name'].cat.codes

# Reverting to previous values:
used_cars['manufacturer_code'].map(name_map) # it's similar to .replace(), and it will replace the Series values based on the keys of the 'name_map' and their corresponding values.
'''
output:
0        Acura
1        Acura
2        Acura
...
'''

In [None]:
# Boolean coding

# Find all body types that have "van" in them:
used_cars["body_type"].str.contains("van", regex=False)

# Create a boolean coding:
used_cars["van_code"] = np.where(
    used_cars["body_type"].str.contains("van", regex=False), 1, 0)
# we use the np.where to say anytime this statement is true, we want to have a 1 value, and anytime this statement is false we want to have a 0.
used_cars["van_code"].value_counts()
'''
output:
0    34115
1     4416
Name: van_code, dtype: int64
'''

## One-hot encoding
### pd.get_dummies()
- data: a pandas DataFrame
- columns: a list-like object of column names
- prefix: a string to add to the beginning of each category
### A few quick notes
- Might create too many features
- **NaN** values do not get their own column


In [None]:
# One-hot encoding on a DataFrame
used_cars_onehot = pd.get_dummies(used_cars[["odometer_value", "color"]])

# Specifying columns to use
used_cars_onehot = pd.get_dummies(used_cars, columns=["color"], prefix="")

# Introduction to Importing Data in Python

## Reading a text file /  Writing to a file / Context manager with

In [None]:
# Reading a text file
filename = 'huck_finn.txt'
file = open(filename, mode='r') # 'r' is to read
text = file.read()
file.close()
print(text)
'''
output:
YOU don't know about me without you have read a book by the name of The Adventures of Tom Sawyer;
but that ain't no matter. That book was made by Mr. Mark Twain, and he told the truth, mainly.
There was things which he stretched, but mainly he told the truth.
That is nothing. never seen anybody but lied one time or another,
without it was Aunt Polly, or the widow, or maybe Mary. Aunt Polly--Tom's Aunt Polly,
she is--andMary, and the Widow Douglas is all told about in thatbook, which is mostly a true book,
with some stretchers, as I said before.
'''

In [None]:
# Writing to a file
filename = 'huck_finn.txt'
file = open(filename, mode='w') # 'w' is to write
# we use this if we want to open a file in order to write to it
file.close()

In [None]:
# Context manager with
# we use this if we want to avoid closing the connection to the file 
with open('huck_finn.txt', 'r') as file:
    print(file.read())
# this allows us to create a context in which you can execute commands with the file open

## Importing flat files using NumPy

In [None]:
# Customizing your NumPy import
import numpy as np
filename = 'MNIST_header.txt'
data = np.loadtxt(filename, delimiter=',', skiprows=1, usecols=[0, 2])
# the default 'delimiter' is any white space so we'll usually nedd to specify it explicitly
# if you data consists of numerics and your header has strings in it, you will want to skip the first eow calling the argument 'skiprows = 1'
# if you want to set your columns, just use 'usecols=[]'
print(data)
'''
output:
[[   0.    0.] 
 [  86.  254.] 
 [   0.    0.] 
...,  
 [   0.    0.] 
 [   0.    0.] 
 [   0.    0.]]
'''

In [None]:
# OBS: we can also import different datatypes into NumPy arrays with the 'dtype'
data = np.loadtxt(filename, delimiter=',', dtype=str)
# 'loadtxt' is great for basic cases, but tends to break down when we have mixed dataypes, for example, the Titanic dataset

# Handling with mixed dataypes 
data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)
d = np.recfromcsv(file, delimiter=',', names=True, dtype=None)

## Importing flat files using pandas
### What a data scientist needs:
- Two-dimensional labeled data structure(s)
- Columns of potentially different types
- Manipulate, slice, reshape, groupby, join, merge
- Perform statistics
- Work with time series data


In [None]:
# Importing using pandas
import pandas as pd
filename = 'winequality-red.csv'
data = pd.read_csv(filename)
data.head()

## Introduction to other file types
### Pickled files
- File type native to Python
- Motivation: many datatypes datatypes for which it isn't obvious how to store them
- Pickled files are serialized
- Serialize = covert object to bytestream


In [None]:
# Pickled files
import pickle
with open('pickled_fruit.pkl', 'rb') as file:
    data = pickle.load(file)    
# to specify both read only and binary, you'll want pass the string 'rb' as the second argument of open
print(data)
'''
output: {'peaches': 13, 'apples': 4, 'oranges': 11}
'''

# Importing Excel spreadsheets
import pandas as pd
file = 'urbanpop.xlsx'
data = pd.ExcelFile(file)
print(data.sheet_names)
'''
output: ['1960-1966', '1967-1974', '1975-2011']
'''
df1 = data.parse('1960-1966') # sheet name, as a string
df2 = data.parse(0) # sheet index, as a float


### SAS and Stata files
- SAS: Statistical Analysis System
- Stata: “Statistics” + “data”
- SAS: business analytics and biostatistics
- Stata: academic social sciences research

In [None]:
# Importing SAS files
import pandas as pd
from sas7bdat import SAS7BDAT
with SAS7BDAT('urbanpop.sas7bdat') as file:    
    df_sas = file.to_data_frame()

# Importing Stata files
import pandas as pd
data = pd.read_stata('urbanpop.dta')


### HDF5 files
- Hierarchical Data Format version 5
- Standard for storing large quantities of numerical data
- Datasets can be hundreds of gigabytes or terabytes
- HDF5 can scale to exabytes

In [None]:
# Importing HDF5 files
import h5py
filename = 'H-H1_LOSC_4_V1-815411200-4096.hdf5'
data = h5py.File(filename, 'r') # 'r' is to read

### MATLAB files
- “Matrix Laboratory”
- Industry standard in engineering and science
- Data saved as .mat files


### SciPy to the rescue!
- scipy.io.loadmat() - read .mat files
- scipy.io.savemat() - write .mat files


In [None]:
# Importing a .mat file
import scipy.io
filename = 'workspace.mat'
mat = scipy.io.loadmat(filename)

## Working with relational databases

In [None]:
# Creating a database engine
from sqlalchemy import create_engine
engine = create_engine('sqlite:///Northwind.sqlite')

# Getting table names
table_names = engine.table_names()
print(table_names)
'''
output:
['Categories', 'Customers', 'EmployeeTerritories',
'Employees', 'Order Details', 'Orders', 'Products',
'Region', 'Shippers', 'Suppliers', 'Territories']
'''

In [None]:
# Querying relational databases
from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite:///Northwind.sqlite')
con = engine.connect()
rs = con.execute("SELECT * FROM Orders")
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()

con.close()

# Using the context manager
from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite:///Northwind.sqlite')

with engine.connect() as con:    
    rs = con.execute("SELECT OrderID, OrderDate, ShipName FROM Orders")    
    df = pd.DataFrame(rs.fetchmany(size=5))    
    df.columns = rs.keys()
# we have another way to do that in just one line
df = pd.read_sql_query("SELECT * FROM Orders", engine)

# More functionalities
from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite:///Chinook.sqlite')

df = pd.read_sql_query("SELECT * FROM Employee WHERE EmployeeId >= 6 ORDER BY BirthDate", engine)

In [None]:
# JOINg tables
from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite:///Northwind.sqlite')

df = pd.read_sql_query("SELECT OrderID, CompanyName FROM Orders INNER JOIN Customers on Orders.CustomerID = Customers.CustomerID", engine)

print(df.head())

# Intermediate Importing Data in Python

## Importing data from the internet

In [None]:
# Automate file download in Python
from urllib.request import urlretrieve
import pandas as pd
url = 'https://assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

urlretrieve(url, 'winequality-red.csv')

df = pd.read_csv('winequality-red.csv', sep=';')
print(df.head())

# Read in all sheets of Excel file
url = 'https://assets.datacamp.com/course/importing_data_into_r/latitude.xls'
xls = pd.read_excel(url, sheet_name=None)
print(xls.keys())
print(xls['1700'].head())

In [None]:
# Import flat files from the web
from urllib.request import urlopen, Request
url = "https://www.wikipedia.org/"
request = Request(url)
response = urlopen(request)
html = response.read()
response.close()

# GET requests using requests
import requests
surl = "https://www.wikipedia.org/"
r = requests.get(url)
text = r.text


### HTML
Mix of unstructured and structured data

Structured data:
- Has pre-defined data model, or
- Organized in a difined manner

Unstructured data:
- Neither of these properties

In [None]:
# BeautifulSoup
from bs4 import BeautifulSoup
import requests
url = 'https://www.crummy.com/software/BeautifulSoup/'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)
print(soup.prettify())

In [None]:
# Exploring BeautifulSoup
print(soup.title)

print(soup.get_text())

for link in soup.find_all('a'):
    print(link.get('href'))

## Introduction do APIs and JSONs

### API
1. Application Programming Interface
2. Set of protocols and routines
- Building and interacting with software applications
3. Bunch of code
- Allows two software programs to communicate with each other

Obs: A standard form for transfering data through APIs is the JSON file format

### JSON
- JavaScript Object Notation
- Real-time server-to-browser communication
- Human readable

In [None]:
# Loading JSONs in Python
import json
with open('snakes.json', 'r') as json_file:
    json_data = json.load(json_file)
    
for key, value in json_data.items():
    print(f"{key}:{value}")

In [None]:
# Connecting to an API in Python
import requests
url = 'http://www.omdbapi.com/?t=hackers'
r = requests.get(url)
json_data = r.json()

for key, value in json_data.items():
    print(f'{key}: {value}')

## Using Tweepy: Authentication

In [None]:
# Using Tweepy: Authentication handler
import tweepy, json
access_token = "..."
access_token_secret = "..."
consumer_key = "..."
consumer_secret = "..."
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

In [None]:
# Tweepy: define stream listener class
class MyStreamListener(tweepy.StreamListener):
    def __init__(self, api=None):
        super(MyStreamListener, self).__init__()
        self.num_tweets = 0
        self.file = open("tweets.txt", "w")
    def on_status(self, status):
        tweet = status._json
        self.file.write(json.dumps(tweet) + '\\n')
        tweet_list.append(status)
        self.num_tweets += 1
        if self.num_tweets < 100:
            return True 
        else: 
            return False
        self.file.close()

In [None]:
# Using Tweepy: stream tweets!

# Create Streaming object and authenticate
l = MyStreamListener() 
stream = tweepy.Stream(auth, l)

# This line filters Twitter Streams to capture data by keywords:
stream.filter(track=['apples', 'oranges'])

In [None]:
# Create Streaming object
stream = tweepy.Stream(consumer_key, consumer_secret, 
                       access_token, access_token_secret)


# Cleaning Data in Python 

### Data type constraints

In [None]:
# String to integers
sales.dtypes
'''
output:
lesOrderID      int64
Revenue         object
Quantity        int64
'''
# Print sum of all Revenue column
sales['Revenue'].sum()
'''
output:
'23153$1457$36865$32474$472$27510$16158$5694$6876$40487$807$6893$9153$6895$4216..
'''
# Remove $ from Revenue column
sales['Revenue'] = sales['Revenue'].str.strip('$')
sales['Revenue'] = sales['Revenue'].astype('int')

# Verify that Revenue is now an integer
assert sales['Revenue'].dtype == 'int'

In [None]:
# The assert statement
# This will pass
assert 1 + 1 == 2

# This will not pass (error)
assert 1 + 1 == 3

AssertionError: 

## Data range constraints

In [None]:
# Convert avg_rating > 5 to 5
movies.loc[movies['avg_rating'] > 5, 'avg_rating'] = 5

# Assert statement
assert movies['avg_rating'].max() <= 5

In [None]:
# Date range example
import datetime as dt
import pandas as pd
# Output data types
user_signups.dtypes
'''
output:
subscription_date    object
user_name            object
Country              object
dtype: object
'''
# Convert to date
user_signups['subscription_date'] = pd.to_datetime(user_signups['subscription_date']).dt.date


In [None]:
# Date range example
today_date = dt.date.today()

# Drop the data 
# Drop values using filtering
user_signups = user_signups[user_signups['subscription_date'] < today_date]
# Drop values using .drop()
user_signups.drop(user_signups[user_signups['subscription_date'] > today_date].index, inplace = True)

# Hardcode dates with upper limit 
# Drop values using filtering
user_signups.loc[user_signups['subscription_date'] > today_date, 'subscription_date'] = today_date
# Assert is true
assert user_signups.subscription_date.max().date() <= today_date

## Uniqueness constraints
The **.duplicated()** method
- **subset**: List of column names to check for duplication. 
- **keep**: Whether to keep first('first'), last('last') or all(False) duplicate values.

In [None]:
# How to find duplicate values?

# Column names to check for duplication
column_names = ['first_name','last_name','address']
duplicates = height_weight.duplicated(subset = column_names, keep = False)

# Get duplicates across all columns
duplicates = height_weight.duplicated()

# Get duplicate rows
duplicates = height_weight.duplicated()
height_weight[duplicates]

The **.drop_duplicates()** method
- **subset**: List of column names to check for duplication. 
- **keep**: Whether to keep first('first'), last('last') or all(False) duplicate values.
- **inplace**: Drop duplicated rows directly inside DataFrame without creating new object(True).

In [None]:
# How to find duplicate rows?

# Output duplicate values
height_weight[duplicates].sort_values(by = 'first_name')

# Drop duplicates
height_weight.drop_duplicates(inplace = True)

In [None]:
# How to treat duplicate values?

# Group by column names and produce statistical summarie
column_names = ['first_name','last_name','address']
summaries = {'height': 'max', 'weight': 'mean'}
height_weight = height_weight.groupby(column_names).agg(summaries).reset_index()

# Make sure aggregation is done
duplicates = height_weight.duplicated(subset = column_names, keep = False)
height_weight[duplicates].sort_values(by = 'first_name')

## Text and categorical data problems
- Categorical data represent variables that represent predefined finite set of categories

### How do we treat these problems?
- Dropping data with incorrect categories
- Remapping incorrect categories to correct ones 
- Inferring categories

In [None]:
# Finding incosistent categories
'''
The problem is: we some impossible blood types in our dataframe as Z+,
so we need to clean it and find the best way to solve this
'''
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
print(inconsistent_categories)
'''
output: {'Z+'}
'''
# Get and print rows with inconsistent categories
inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)
study_data[inconsistent_rows]      
'''
output: name   birthday blood_type
    5 Jennifer 2019-12-17        Z+
''''

## What type of errors could we have?
### I) Value inconsistency 
- Inconsistentelds: **'married'**, **'Maried'**, **'UNMARRIED'**, **'not married'**... 
- _Trailing white spaces: _**'married '**, **' married '**..

### II) Collapsing too many categories to few 
- Creating new groups: **0-20K**, **20-40K** categories... from continuous household income data
- Mapping groups to new ones: Mapping household income categories to 2 **'rich'**, **'poor'** 

### III) Making sure data is of type category


In [None]:
# Value consistency

# Capitalization: 'married', 'Married', 'UNMARRIED', 'unmarried'.
# Get marriage status column
marriage_status = demographics['marriage_status']
marriage_status.value_counts()
'''
output:
unmarried    352
married      268
MARRIED      204
UNMARRIED    176
dtype: int64
'''
# Capitalize
marriage_status['marriage_status'] = marriage_status['marriage_status'].str.upper()
marriage_status['marriage_status'].value_counts()
# Lowercase
marriage_status['marriage_status'] = marriage_status['marriage_status'].str.lower()
marriage_status['marriage_status'].value_counts()


# Trailing spaces: 'married ', 'married', 'unmarried', ' unmarried'.
# Get marriage status column
marriage_status = demographics['marriage_status']
marriage_status.value_counts() 
'''
output:
 unmarried   352
unmarried    268
married      204
married      176
dtype: int64
'''
# Strip all spaces
demographics = demographics['marriage_status'].str.strip()
demographics['marriage_status'].value_counts()
'''
output:
unmarried    528
married      472
'''

In [None]:
# Collapsing data into categories

# Create categories out of data: income_group column from income column

# Using qcut()
import pandas as pd
group_names = ['0-200K', '200K-500K', '500K+']
demographics['income_group'] = pd.qcut(demographics['household_income'], q = 3, labels = group_names)

# Print income_group column
demographics[['income_group', 'household_income']]


# Using cut() - create category ranges and names
ranges = [0, 200000, 500000, np.inf]
group_names = ['0-200K', '200K-500K', '500K+']

# Create income group column
demographics['income_group'] = pd.cut(demographics['household_income'], bins=ranges, labels=group_names)

# Print income_group column
demographics[['income_group', 'household_income']]


# Map categories to fewer ones: reducing categories incategorical column.
'''
operating_system column is: 'Microsoft', 'MacOS', 'IOS', 'Android', 'Linux'
operating_system column should become: 'DesktopOS', 'MobileOS'
'''
# Create mapping dictionary and replace
mapping = {'Microsoft':'DesktopOS', 'MacOS':'DesktopOS', 'Linux':'DesktopOS','IOS':'MobileOS', 'Android':'MobileOS'}
devices['operating_system'] = devices['operating_system'].replace(mapping)
devices['operating_system'].unique()

## Cleaning text data
### Common text data problemns
1. Data inconsistency: +96171679912 or 0096171679912 or..?
2. Fixed length violations: Passwords needs to be at least 8 characters 
3. Typos: +961.71.679912

In [None]:
# Fixing the phone number column

# Replace "+" with "00"
phones["Phone number"] = phones["Phone number"].str.replace("+", "00")

# Replace "-" with nothing
phones["Phone number"] = phones["Phone number"].str.replace("-", "")

# Replace phone numbers with lower than 10 digits to NaN
digits = phones['Phone number'].str.len()
phones.loc[digits < 10, "Phone number"] = np.nan


# Find length of each row in Phone number column
sanity_check = phone['Phone number'].str.len()

# Assert minmum phone number length is 10
assert sanity_check.min() >= 10

# Assert all numbers do not have "+" or "-"
assert phone['Phone number'].str.contains("+|-").any() == False

In [None]:
# Regular expressions in action

# Replace letters with nothing
phones['Phone number'] = phones['Phone number'].str.replace(r'\D+', '')
phones.head()

## Uniformity, Cross field validation, Completeness

### Treating ambiguous date data 
**Is 2019-03-08 in August or March?**
- Convert to NA and treat accordingly
- Infer format by understanding data source
- Infer format by understanding previous and subsequent data in DataFrame


### Cross field validation
**What to do when we catch inconsistencies?**
- Dropping Data
- Set to missing and impute
- Apply rules from domain knowledge

### Completeness
**How to deal with missing data?**
**Simple approaches:**
1. Drop missing data
2. Impute with statistical measures (mean, median, mode...)
**More complex approaches:**
1. Imputing using an algorithmic approach
2. Impute with machine learning models

**Missingness types**
1. Missing Completely at Random (MCAR):
- No systematic relationship between missing data and other values
- Data entry erros when imputting data
2. Missing at Random (MAR):
- Systematic relationship between missing data and other **observed** values
- Missing ozone data for high temperatures
3. Missing Not at Random (MNAR)
- Systematic relationship between missing data and **unobserved** values
- Missing temperature values for high temperatures

In [None]:
# Uniformity

# Treating temperature data
temp_fah = temperatures.loc[temperatures['Temperature'] > 40, 'Temperature']
temp_cels = (temp_fah - 32) * (5/9)
temperatures.loc[temperatures['Temperature'] > 40, 'Temperature'] = temp_cels

# Assert conversion is correct
assert temperatures['Temperature'].max() < 4095


# Treating date data
birthdays['Birthday'] = pd.to_datetime(birthdays['Birthday'],
                        # Attempt to infer format of each date
                        infer_datetime_format=True, 
                        # Return NA for rows where conversion failed
                        errors = 'coerce')

# Another way to treating date data
birthdays['Birthday'] = birthdays['Birthday'].dt.strftime("%d-%m-%Y")

In [None]:
# Cross field validation

# Checking data integrity
fund_class=['economy_class', 'business_class', 'first_class']
sum_classes = flights[fund_class].sum(axis = 1)
passenger_equ = sum_classes == flights['total_passengers']

# Find and filter out rows with inconsistent passenger totals
inconsistent_pass = flights[~passenger_equ]
consistent_pass = flights[passenger_equ]


import pandas as pd
import datetime as dt
# Convert to datetime and get today's date
users['Birthday'] = pd.to_datetime(users['Birthday'])
today = dt.date.today()

# For each row in the Birthday column, calculate year difference
age_manual = today.year - users['Birthday'].dt.year

# Find instances where ages match
age_equ = age_manual == users['Age']

# Find and filter out rows with inconsistent age
inconsistent_age = users[~age_equ]
consistent_age = users[age_equ]

In [None]:
# Completeness

# Visualizing and understanding missing data
import missingno as msno
import matplotlib.pyplot as plt

# Visualize missingness
msno.matrix(airquality)
plt.show


# Dealing with missing data

# Dropping missing values
airquality_dropped = airquality.dropna(subset = ['CO2'])

# Replacing with statiscal measures
co2_mean = airquality['CO2'].mean()
airquality_imputed = airquality.fillna({'CO2': co2_mean})

## Record linkage

### Comparing strings
**Minimum edit distance**
- Least possible amount of steps needed to transition from one string to another

In [None]:
# Simple string comparison

# Lets us compare between two strings
from thefuzz import fuzz

# Compare reeding vs reading
fuzz.WRatio('Reeding', 'Reading')

In [None]:
# Partial strings and different orderings

# Partial string comparison
fuzz.WRatio('Houston Rockets', 'Rockets')

# Partial string comparison with different order
fuzz.WRatio('Houston Rockets vs Los Angeles Lakers', 'Lakers vs Rockets')

In [None]:
# Comparison with arrays

# Import process
from thefuzz import process

#Define string and array of possible matches
string = "Houston Rockets vs Los Angeles Lakers"
choices = pd.Series(['Rockets vs Lakers', 'Lakers vs Rockets', 'Houson vs Los                           Angeles', 'Heat vs Bulls'])
process.extract(string, choices, limit = 2)

In [None]:
# Collapsing all of the state

# For each correct category
for state in categories['state']:
    # Find potential matches in states with typoes    
    matches = process.extract(state, survey['state'], limit = survey.shape[0])
    # For each potential match match
    for potential_match in matches:
        # If high similarity score
        if potential_match[1] >= 80:
            # Replace typo with correct category          
            survey.loc[survey['state'] == potential_match[0], 'state'] = state

## Generating pairs

In [None]:
# Generating Pairs

# Import recordlinkage
import recordlinkage

# Create indexing object
indexer = recordlinkage.Index()

# Generate pairs blocked on state
indexer.block('state')
pairs = indexer.index(census_A, census_B)

In [None]:
# Comparing the DataFrames

# Generate the pairs
pairs = indexer.index(census_A, census_B)

# Create a Compare object
compare_cl = recordlinkage.Compare()

# Find exact matches for pairs of date_of_birth and state
compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('state', 'state', label='state')

# Find similar matches for pairs of surname and address_1 using string similarity
compare_cl.string('surname', 'surname', threshold=0.85, label='surname')
compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')

# Find matches
potential_matches = compare_cl.compute(pairs, census_A, census_B)

## Linking DataFrames

In [None]:
# Propable matches
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]

# Get the indices
matches.index
'''
output:
MultiIndex(levels=[['rec-1007-org', 'rec-1016-org', 'rec-1054-org', 'rec-1066-org', 'rec-1070-org', 'rec-1075-org', 'rec-1080-org', 'rec-110-org', ...
'''
# Get indices from census_B only
duplicate_rows = matches.index.get_level_values(1)
print(census_B_index)
'''
output:
Index(['rec-2404-dup-0', 'rec-4178-dup-0', 'rec-1054-dup-0', 'rec-4663-dup-0',       'rec-485-dup-0', 'rec-2950-dup-0', 'rec-1234-dup-0', ... , 'rec-299-dup-0'])
'''

# Linking DataFrames
# Finding duplicates in census_B
census_B_duplicates = census_B[census_B.index.isin(duplicate_rows)]

# Finding new rows in census_B
census_B_new = census_B[~census_B.index.isin(duplicate_rows)]

# Link the DataFrames!
full_census = census_A.append(census_B_new)


# Import recordlinkage and generate pairs and compare across columns...

# Generate potential matches
potential_matches = compare_cl.compute(full_pairs, census_A, census_B)

# Isolate matches with matching values for 3 or more columns
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]

# Get index for matching census_B rows only
duplicate_rows = matches.index.get_level_values(1)

# Finding new rows in census_B
census_B_new = census_B[~census_B.index.isin(duplicate_rows)]

# Link the DataFrames!
full_census = census_A.append(census_B_new)


# Working with Dates and Times in Python

### Weekdays in Python
- 0 = Monday
- 1 = Tuesday
- 2 = Wednesday
- 3 = Thursdat
- 4 = Friday
- 5 = Saturday
- 6 = Sunday

In [None]:
# Dates in Python

# Import date
from datetime import date
# Create dates
two_hurricanes_dates = [date(2016, 10, 7), date(2017, 6, 21)]

# Attributes of a date
print(two_hurricanes_dates[0].year)
print(two_hurricanes_dates[0].month)
print(two_hurricanes_dates[0].day)

print(two_hurricanes_dates[0].weekday())

In [None]:
# Turning dates into strings

from datetime import date

d = date(2017, 11, 5) # Example date
# ISO format: YYYY-MM-DD
print(d)

# Express the date in ISO 8601 format and put it in a list
print([d.isoformat()])

# A few dates that computers once had trouble with
some_dates = ['2000-01-01', '1999-12-31']
# Print them in order
print(sorted(some_dates))['1999-12-31', '2000-01-01']


In [None]:
# Other forms to create a date

d = date(2017, 1, 5)

print(d.strftime("Year is %Y")) # fills the year in this string

### Dates and Times
How to represent this in python:
October 1 2017, 3:23:25 PM

In [None]:
# Import datetime
from datetime import datetime
dt = datetime(2017, 10, 1, 15, 23, 25)
# or being more explicit 
dt = datetime(year=2017, month=10, day=1, 
              hour=15, minute=23 second=25, 
              microsecond=500000)
print(dt)

# Replacing parts of a datetime
dt_hr = dt.replace(minute=0, second=0, microsecond=0)
print(dt_hr)

# Printing datae times
print(dt.strftime("%Y-%m-%d %H:%M:%S"))

# A timestamp
ts = 1514665153.0
# Convert to datetime and print
print(datetime.fromtimestamp(ts))

### Time Zones and Daylight Saving

In [None]:
# UTC

# Import relevant classes
from datetime import datetime, timedelta, timezone
# US Eastern Standard time zone
ET = timezone(timedelta(hours=-5))
# Timezone-aware datetime
dt = datetime(2017, 12, 30, 15, 9, 3, tzinfo = ET)
print(dt)


# India Standard time zone
IST = timezone(timedelta(hours=5, minutes=30))
# Convert to IST
print(dt.astimezone(IST))

In [None]:
# Time zone database

# Imports
from datetime import datetime
from dateutil import tz

# Eastern time
et = tz.gettz('America/New_York')

# Ending Daylight Saving Time
eastern = tz.gettz('US/Eastern')
# 2017-11-05 01:00:00
first_1am = datetime(2017, 11, 5, 1, 0, 0, tzinf=eastern)
tz.datetime_ambiguous(first_1am)
# 2017-11-05 01:00:00 again
second_1am = datetime(2017, 11, 5, 1, 0, 0, tzinfo=eastern)
second_1am = tz.enfold(second_1am)

### Reading date and time data in Pandas

In [None]:
# Loading datetimes with parse_dates

# Import W20529's rides in Q4 2017
rides = pd.read_csv('capital-onebike.csv', 
                    parse_dates = ['Start date', 'End date'])
# Or: 
rides['Start date'] = pd.to_datetime(rides['Start date'], 
                                     format = "%Y-%m-%d %H:%M:%S")

In [None]:
# Timezone aware arithmetic

# Create a duration column
rides['Duration'] = rides['End date'] - rides['Start date']

# Corverting our 'Duration' column to seconds
rides['Duration']\
    .dt.total_seconds()\
    .head(5)

In [None]:
# Sumarizing data in Pandas

# Percent of time out of the dock
rides['Duration'].sum() / timedelta(days=91)

# Percent of rides by member
rides['Member type'].value_counts() / len(rides)

# Add duration (in seconds) column
rides['Duration seconds'] = rides['Duration'].dt.total_seconds()
# Average duration per member type
rides.groupby('Member type')['Duration seconds'].mean()

# Average duration by month
rides.resample('M', on = 'Start date')['Duration seconds'].mean()

# Size per group
rides.groupby('Member type').size()

# First ride per group
rides.groupby('Member type').first()

In [None]:
# Try to set a timezone
rides['Start date'] = rides['Start date']\
    .dt.tz_localize('America/New_York', ambiguous='NaT')
# without 'ambigous' we'll down in to an error

# Shift the indexes forward one, padding with NaT
rides['End date'].shift(1).head(3)

# Writing functions in Python

In [None]:
# Docstrings
def the_answer():
    '''Return the answer ot life,
    the universe, and everything
    
    Returns:
        int
    '''

# If I want to access the docstring of a function
import inspect
print(inspect.getdoc(the_answer))

In [None]:
# Using context managers
with <context-manager>(<args>) as <variable-name>:
    # Run your code here
    # This code is running "inside the context"

# This code runs after the context is removed

###  The open() function does three things
- Sets up a context by opening a file
- Lets you run any code you want on that file
- Removes the context by closing the file

In [None]:
# A real-world example
with open('my_file.txt') as my_file:  
    text = my_file.read()  
    length = len(text)

print('The file is {} characters long'.format(length))

### Two ways to define a context managers
- Class-based
- Function-based

In [None]:
# Writing context managers
def mt_context():
    # Add any set up code you need
    yield
    # Add any teardown code you need

### How to create a context manager
1. Define a function
2. (optional) Add any set up code your context needs
3. Use the "yield keyword"
4. (optional) Add any teardown code your context needs
5. Add the 'contextlib.contextmanager' decorator

In [None]:
# The "yield" keyword
@contextlib.contextmanagerdefmy_context():  
    print('hello')
    yield 42
    # It means that you are going to return a value,
    # but you expect to finish the rest of the function at some point the future.
    print('goodbye')

In [None]:
with my_context() as foo:  
    print('foo is {}'.format(foo))

In [None]:
# Setup and teardown
@contextlib.contextmanager
def database(url):
    # set up database connection  
    db = postgres.connect(url)
    
    yield db
    
    # tear down database connection  
    db.disconnect()

In [None]:
# Yielding a value or None
url = 'http://datacamp.com/data'
with database(url) as my_db:  
    course_list = my_db.execute('SELECT * FROM courses')

In [None]:
'''
Changes the current working directory
to a specific path and then changes it 
back after the context block is done
'''
@contextlib.contextmanager
de fin_dir(path):
    # save current working directory
    old_dir = os.getcwd()
    
    # switch to new working directory
    os.chdir(path)
    
    yield
    # change back to previous
    # working directory
    os.chdir(old_dir)

In [None]:
with in_dir('/data/project_1/'):
    project_files = os.listdir()

## Advanced topics

In [None]:
# Nested contexts
def copy(src, dst):
    """
    Copy the contents of one file to another.
    
    Args:
        src (str): File name of the file to be copied.
        dst (str): Where to write the new file.
    """
# Open both files
with open(src) as f_src:
    with open(dst, 'w') as f_dst:
        # Read and write each line, one at a time
        for line in f_src:
            f_dst.write(line)

In [None]:
# Handling errors
def get_printer(ip):
    p = connect_to_printer(ip)
    
    try:
        yield
    
    # This MUST be called or no one else will
    # be able to connect to the printer
    finally:
        p.disconnect()
    print('disconnected from printer')
    
doc = {'text': 'This is my text.'}

### Context manager patterns
- Open / Close
- Lock / Release
- Change / Reset
- Enter / Exit
- Start / Stop
- Setup / Teardown
- Connect / Disconnect

## Decorators

In [None]:
# Functions as objects

# Everything here is an object
defx():
    pass
x = [1, 2, 3]
x = {'foo': 42}
x = pandas.DataFrame()
x = 'This is a sentence.'
x = 3
x = 71.2
import x

In [None]:
# Function as variables
def my_function():  
    print('Hello')
x = my_function
print(type(x))

PrintyMcPrintface = print
PrintyMcPrintface('Python is awesome!')Python is awesome!

In [None]:
# Lists and dictionaries of functions
list_of_functions = [my_function, open, print]
list_of_functions[2]('I am printing with an element of a list!')

dict_of_functions = {
    'func1': my_function,
    'func2': open,
    'func3': print
}
dict_of_functions['func3']('I am printing with a value of a dict!')

In [None]:
# Functions as arguments
def has_docstring(func):
    """Check to see if the function
    `func` has a docstring.
    
    Args:
        func (callable): A function.
    Returns:
        bool
    """
return func.__doc__ is not None

# Examples of functions as arguments
def no():
    return 42

def yes():
    """Return the value 42  """
    return42
    
print(has_docstring(no))
print(has_docstring(yes))

## Closures

In [None]:
# Attaching non local variables to nested functions
def foo():
    a = 5
    def bar():
        print(a)
    return bar

func = foo()

func()

In [None]:
# Closures!
print(type(func.__closure__))
print(len(func.__closure__))
print(func.__closure__[0].cell_contents)

**Nested function:** A function defined inside another function

In [None]:
# Outer function
def parent():
    # nested function
    def child():
        pass
    return child

**Nonlocal variables**: Variables defined in the parent function that are used by the child function

In [None]:
def parent(arg_1, arg_2):
    # From child()'s point of view,
    # `value` and `my_dict` are nonlocal variables,
    # as are `arg_1` and `arg_2`.
    value = 22
    my_dict = {'chocolate': 'yummy'}
    
    def child():
        print(2 * value)
        print(my_dict['chocolate'])
        print(arg_1 + arg_2)
        
    return child

**Closure**: Nonlocal variables attached to a returned function

In [None]:
def parent(arg_1, arg_2):
    value = 22
    my_dict = {'chocolate': 'yummy'}

    def child():
        print(2 * value)
        print(my_dict['chocolate'])
        print(arg_1 + arg_2)
    
    return child

new_function = parent(3, 4)

print([cell.cell_contents for cell in new_function.__closure__])

In [None]:
# Finally, we'll talk about Decorators

# What does a decorator look like?
@double_args
def multiply(a, b):
    return a * b
multiply(1, 5)

In [None]:
# The double_args decorator (an example about how decorators works)
def multiply(a, b):
    return a * b
def double_args(func):
    def wrapper(a, b):
        return func(a * 2, b * 2)
    return wrapper 
# multiply = double_args(multiply) --> we can combine this, but it's the same result
multiply(1, 5)

## More on decorators

In [None]:
# Time a function
import time

def timer(func):
    """A decorator that prints how long a function took to run."""
    # Define the wrapper function to return.
    def wrapper(*args, **kwargs):
        # When wrapper() is called, get the current time.
        t_start = time.time()
        # Call the decorated function and store the result.
        result = func(*args, **kwargs)
        # Get the total time it took to run, and print it.
        t_total = time.time() - t_start
        print('{} took {}s'.format(func.__name__, t_total))
        return result
    return wrapper

In [None]:
# Using timer()
@timer
def sleep_n_seconds(n):
    time.sleep(n)
    
sleep_n_seconds(10)

In [None]:
# Storing the results of a function
def memoize(func):
    """Store the results of the decorated function for fast lookup"""
    # Store results in a dict that maps arguments to results
    cache = {}
    # Define the wrapper function to return.
    def wrapper(*args, **kwargs):
        # If these arguments haven't been seen before,
        if (args, kwargs) notin cache:
            # Call func() and store the result.
            cache[(args, kwargs)] = func(*args, **kwargs)
            return cache[(args, kwargs)]
        return wrapper

In [None]:
# Using memoize()
@memoize
def slow_function(a, b):
    print('Sleeping...')
    time.sleep(5)
    return a + b

slow_function(3, 4)

### When to use decorators
- Add commom behavior to multiple functions

In [None]:
# Decorators and metadata
from functools import wraps
def timer(func):
    """A decorator that prints how long a function took to run."""
    @wraps(func)
    def wrapper(*args, **kwargs):
        t_start = time.time()
        result = func(*args, **kwargs)
        t_total = time.time() - t_start
        print('{} took {}s'.format(func.__name__, t_total))
        return result 
    return wrapper

## Decorators that take arguments

In [None]:
def run_n_times(n):
    """Define and return a decorator"""
    def decorator(func):
        def wrapper(*args, **kwargs):
            for i in range(n):
                func(*args, **kwargs)
                return wrapper 
            return decorator
        run_three_times = run_n_times(3)
@run_three_times 
def print_sum(a, b):
    print(a + b)
@run_n_times(3)
def print_sum(a, b):
    print(a + b)

### Timeout - background info

In [None]:
import signal
def raise_timeout(*args, **kwargs):
    raise TimeoutError()
# When an "alarm" signal goes off, call raise_timeout()
signal.signal(signalnum=signal.SIGALRM, handler=raise_timeout)
# Set off an alarm in 5 seconds
signal.alarm(5)
# Cancel the alarm
signal.alarm(0)

### Timeout itself

In [None]:
def timeout(n_seconds):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            # Set an alarm for n seconds
            signal.alarm(n_seconds)
            try:
                # Call the decorated func
                return func(*args, **kwargs)
            finally:
                # Cancel alarm
                signal.alarm(0)
        return wrapper
    return decorator

# Introduction to Regression with statsmodels in Python

## What is regression?
1. Statistical models to explore the relationship a response variable and some explanatory variables.
2. Given values of explanatory variables, you can predict the values of the response variable.

### Jargon
**Response variable (a.k.a. dependent variable):**
- The variable that you want to predict.
**Explanatory variables (a.k.a. independent variables):**
- The variables that explain how the response variable will change.

## Linear Regression and Logistic Regression
**Linear regression:**
- The response variable is numeric.
**Logistic regression:**
- The response variable is logical.
**Simple linear/logistic regression:**
- There is only one explanatory variable.

### Straight lines are defined by two things
**Intercept:**
- The value at the point when x is zero.
**Slope:**
-The amount the y value increases if you increase x by one.
**Equation:**
- y = intercept + slope * x.

In [None]:
# Making predictions
from statsmodels.formula.api import ols # ordinary least squares
                                        # it's a type of regression
# Running the model
mdl_mass_vs_length = ols("mass_g ~ length_cm", data=bream).fit()
print(mdl_mass_vs_length.params)

# Predicting inside a DataFrame
explanatory_data = pd.DataFrame(
    {"length_cm": np.arange(20, 41)}
)
prediction_data = explanatory_data.assign(    mass_g=mdl_mass_vs_length.predict(explanatory_data)
)
print(prediction_data)

# Extrapolating means (making predictions outside the range of observed data)
little_bream = pd.DataFrame({"length_cm": [10]})

pred_little_bream = little_bream.assign(    mass_g=mdl_mass_vs_length.predict(little_bream))

print(pred_little_bream)

In [None]:
# Regression to the mean
'''
Response value = fitted value + residual
or
               = 'the stuff you explained' + 'the stuff you couldn't explain
obs: residual exist due to problems in the model and fundamental randomness
'''
# Adding a regression line in scatter plot
fig = plt.figure()

sns.regplot(x="father_height_cm",                
                y="son_height_cm",                
                data=father_son,
                ci = None,
                line_kws={'color':'black'})
plt.axline(xy1=(150, 150),
           slope=1,
           linewidth=2,
           color="green")
plt.axis("equal")

plt.show()

## Coefficient of determination (Quantifying model fit) 
- Sometimes called "r-squared" or "R-squared".
**The proportion of the variance in the response variable that is predictable from the explanatory variable.**
- 1 means a perfect fit 
- 0 means the worst possible fit.

In [None]:
# Shows several performance metrics in its output
mdl_bream = ols("mass_g ~ length_cm", data=bream).fit() #apply ML model

print(mdl_bream.summary()) # see the metrics of its model
# or you can use 'print(mdl_bream.rsquared)' to be more specific

### Residual Stardanrd Error (RSE)
1. A "typical" difference between a prediction and an observed response.
2. It has the same unit as the response variable.
3. MSE = RSE².

In [None]:
# Calculate the mse
mse = mdl_bream.mse_resid
print('mse: ', mse)

# Calculate the RSE
rse = np.sqrt(mse)
print('rse: ', rse)

# Calculating the RSE from "scratch"
residuals_sq = mdl_bream.resid ** 2

resid_sum_of_sq = sum(residuals_sq)

deg_freedom = len(bream.index) - 2
'''
Degrees of freedom equals the number of observations
minus the number of model coefficients.
'''
rse = np.sqrt(resid_sum_of_sq / deg_freedom)

print('rse: ', rse)

# What's the difference between RSE and RMSE
# In RMSE we don't calculate degrees of freedon, it's just the number of observations
residuals_sq = mdl_bream.resid ** 2

resid_sum_of_sq = sum(residuals_sq)

n_obs = len(bream.index)

rmse = np.sqrt(resid_sum_of_sq / n_obs)

print('rmse: ', rmse)

## Visualizing model fit

In [None]:
# Residual vs. fitted values
'''
Here you can see diagnostic plots of residuals versus 
fitted values for two models on advertising conversion.
'''
sns.residplot(x="length_cm", y="mass_g", data=bream, lowess=True)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")

# GGplot
from statsmodels.api import qqplot
qqplot(data=mdl_bream.resid, fit=True, line="45")

# Scale-location plot
'''
Here are normal scale-location plots of the previous two models.
That is, they show the size of residuals versus fitted values.
'''
model_norm_residuals_bream = mdl_bream.get_influence().resid_studentized_internal
model_norm_residuals_abs_sqrt_bream = np.sqrt(np.abs(model_norm_residuals_bream))

sns.regplot(x=mdl_bream.fittedvalues, y=model_norm_residuals_abs_sqrt_bream,
            ci=None, lowess=True)

plt.xlabel("Fitted values")
plt.ylabel("Sqrt of abs val of stdized residuals")

### Outliers, leverage and influence

In [1]:
# .get_influence()and.summary_frame()
mdl_roach = ols("mass_g ~ length_cm", data=roach).fit()
summary_roach = mdl_roach.get_influence().summary_frame()
roach["leverage"] = summary_roach["hat_diag"]

## Logistic Regression

In [None]:
from statsmodels.formula.api import logit
mdl_churn_vs_recency_logit = logit("has_churned ~ time_since_last_purchase",                                   data=churn).fit()
print(mdl_churn_vs_recency_logit.params)

sns.regplot(x="time_since_last_purchase", 
            y="has_churned", data=churn, 
            ci=None, logistic=True)
plt.axline(xy1=(0,intercept), slope=slope, color="black")

plt.show()

### Predictions and odds ratios
**Odds ratio**
- Odds ratio is the probability of something happening divided by the probability that it doesn't

In [None]:
## Visualizing odds ratio
prediction_data["odds_ratio"] = prediction_data["has_churned"] /                                 (1 - prediction_data["has_churned"])

sns.lineplot(x="time_since_last_purchase", y="odds_ratio", data=prediction_data)

plt.axhline(y=1, linestyle="dotted")
plt.yscale("log")

plt.show()

## Quantifying logistic regression fit

In [None]:
# Confusion matrix ('Implementation')
actual_response = churn["has_churned"]

predicted_response = np.round(mdl_recency.predict())

outcomes = pd.DataFrame({"actual_response": actual_response,"predicted_response": predicted_response})

print(outcomes.value_counts(sort=False))

# Visualizing the confusion matrix
conf_matrix = mdl_recency.pred_table()

from statsmodels.graphics.mosaicplot import mosaic

mosaic(conf_matrix)

# Sampling in Python

## Population vs. sample
**The population is the complete datase**
- Doesn't have to refer to people
- Typically, don't know what the whole population is


**The sample is the subset of data you calculate on**


## Population parameters & point estimates
1. A population parameter is a calculation made on the population dataset.
2. A point estimate or sample statistic is a calculation made on the sample dataset.

In [None]:
# Population parameter
import numpy as np
np.mean(pts_vs_flavor_pop['total_cup_points'])

# Point estimate
cup_points_samp = coffee_ratings['total_cup_points'].sample(n=10)
np.mean(cup_points_samp)

In [None]:
# Visualizing selection bias
import matplotlib.pyplot as plt
import numpy as np

# Population distribution
coffee_ratings["total_cup_points"].hist(bins=np.arange(59, 93, 2))
plt.show()

# Sample distribution
coffee_ratings_first10["total_cup_points"].hist(bins=np.arange(59, 93, 2))
plt.show()

In [None]:
# Pseudo-random number generation 
randoms = np.random.beta(a=2, b=2, size=5000)
plt.hist(randoms, bins=np.arange(0, 1, 0.05))
plt.show()

# Random number seeds
np.random.seed(20000229)
print(np.random.normal(loc=2, scale=1.5, size=2))
print(np.random.normal(loc=2, scale=1.5, size=2))
print(np.random.normal(loc=2, scale=1.5, size=2))

## Simple random and systematic sampling

In [None]:
# Simple random sampling 
coffee_ratings.sample(n=5, random_state=19000113)

# Systematic sampling 
sample_size = 5
pop_size = len(coffee_ratings)

# Defining the interval
interval = pop_size // sample_size

# Selecting the rows
coffee_ratings.iloc[::interval]
'''
OBS: Systematic sampling is only safe if we 
don't see a pattern in this scatter plot
'''
# Making systematic sampling safe
shuffled = coffee_ratings.sample(frac=1)
shuffled = shuffled.reset_index(drop=True).reset_index()
shuffled.plot(x="index", y="aftertaste", kind="scatter")
'''
OBS: Shuffling rows + systematic sampling
is the same as simple random sampling
'''

## Stratified and weighted random sampling

In [None]:
# Counts of a simple random sample
coffee_ratings_samp = coffee_ratings_top.sample(frac=0.1, random_state=2021)
coffee_ratings_samp['country_of_origin'].value_counts(normalize=True)

# Proportional stratified sampling
coffee_ratings_strat = coffee_ratings_top.groupby("country_of_origin").sample(frac=0.1, random_state=2021)
coffee_ratings_strat['country_of_origin'].value_counts(normalize=True)

# Equal counts stratified sampling
coffee_ratings_eq = coffee_ratings_top.groupby("country_of_origin").sample(n=15, random_state=2021)
coffee_ratings_eq['country_of_origin'].value_counts(normalize=True)

# Weighted random sampling
# Specify weights to adjust the relative probability of a row being sampled
import numpy as np

coffee_ratings_weight = coffee_ratings_top
condition = coffee_ratings_weight['country_of_origin'] == "Taiwan"
coffee_ratings_weight['weight'] = np.where(condition, 2, 1)
coffee_ratings_weight = coffee_ratings_weight.sample(frac=0.1, weights="weight")

## Clustering sampling

### Stratified sampling vs. cluster sampling
**Stratified sampling:**
- Split the population into subgroups.
- Use simple random sampling on every subgroup.
**Cluster sampling:**
- Use simple random sampling to pick some subgroups.
- Use simple random sampling on only those subgroups.

In [None]:
# Stage 1: sampling for subgroups
import random
varieties_samp = random.sample(varieties_pop, k=3)

# Stage 2: sampling each group
variety_condition = coffee_ratings['variety'].isin(varieties_samp)
coffee_ratings_cluster = coffee_ratings[variety_condition]

coffee_ratings_cluster['variety'] = coffee_ratings_cluster['variety'].cat.remove_unused_categories()

coffee_ratings_cluster.groupby("variety").sample(n=5, random_state=2021)

### Multistage sampling
- Cluster sampling is a type of multistage sampling
- Can have > 2 stages
- E.g., countrywide surveys may sample states, counties, cities, and neighborhoods


## Sampling Distributions

### Relative error of point estimates

Properties:
- Really noise, particularly for small samples
- Amplitude is initially steep, then flattens
- Relative error decreases to zero (when thesample size = population)

In [None]:
# Population parameter:
population_mean = coffee_ratings['total_cup_points'].mean()

# Point estimate
sample_mean = coffee_ratings.sample(n=sample_size)['total_cup_points'].mean()

# Relative error as a percentage:
rel_error_pct = 100 * abs(population_mean - sample_mean) / population_mean

### Creating a sampling distribution

In [None]:
mean_cup_points_1000 = []

for i in range(1000):
    mean_cup_points_1000.append(
        coffee_ratings.sample(n=30)['total_cup_points'].mean()
    )

print(mean_cup_points_1000)

### Aproximate sampling distributions

In [None]:
# Create a DataFrame an Calculate the Mean
dice = expand_grid(  {'die1': [1, 2, 3, 4, 5, 6],
                      'die2': [1, 2, 3, 4, 5, 6],
                      'die3': [1, 2, 3, 4, 5, 6],
                      'die4': [1, 2, 3, 4, 5, 6]
                     }
                  )
dice['mean_roll'] = (dice['die1'] +
                     dice['die2'] +
                     dice['die3'] +
                     dice['die4']) / 4
print(dice)

In [None]:
# Exact sampling distribution
dice['mean_roll'] = dice['mean_roll'].astype('category')
dice['mean_roll'].value_counts(sort=False).plot(kind="bar")

In [None]:
# The number of outcomes increases fast
n_dice = list(range(1, 101))
n_outcomes = []
for n in n_dice:
    n_outcomes.append(6**n)
    outcomes = pd.DataFrame(
        {"n_dice": n_dice,
         "n_outcomes": n_outcomes})

# Plot the results
outcomes.plot(x="n_dice",
              y="n_outcomes",
              kind="scatter")
plt.show()

In [None]:
# Simulation the mean of four dice rolls
sample_means_1000 = []
for i in range(1000):
    sample_means_1000.append(
        np.random.choice(list(range(1, 7)), size=4, replace=True).mean()
    )

print(sample_means_1000)

# Approximate sampling distribution
plt.hist(sample_means_1000, bins=20)
plt.show()

## Standard erros and Central Limit Theorem

### Consequences of the central limit theorem

**Averages of independent samples have approximately normal distributions.**

As the sample size increases,
- The distribution of the averages gets _closer to being normally distributed_
- The width of the sampling distribution gets _narrower_


In [None]:
# Population & sampling distribution standard deviations
coffee_ratings['total_cup_points'].std(ddof=0)
'''
- Specify ddof=0 when calling .std() on populations
- Specify ddof=1 when calling np.std() on samples or sampling distributions
'''

### Bootstrapping
**The opposite of sampling from a population**

- Sampling: going from a population to a smaller sample
- Bootstrapping: building up a theoretical population from the sample


OBS: Bootstrapping use case:
- Develop understanding of sampling variability using a single sample

### Bootstrapping process
1. Make a resample of the same size as the original sample
2. Calculate the statistic of interest for this bootstrap sample 
3. Repeat steps 1 and 2 many times


**The resulting statistics are bootstrap statistics, and they form a bootstrap distribution**


In [None]:
# Bootstrapping coffee mean flavor
import numpy as np
mean_flavors_1000 = []
for i in range(1000):
    mean_flavors_1000.append(
        np.mean(coffee_sample.sample(frac=1, replace=True)['flavor'])
    )

### Interpreting the standard errors
- Estimated standard error → standard deviation of the bootstrap distribution for a samplestatistic
- Population std_dev ≈ Std_Error × √Sample_size


### Confidence intervals
- "Values within one standard deviation of the mean" includes a large number of values fromeach of these distributions
- We'll define a related concept called a confidence interval


In [None]:
# Inverse cumulative distribution function
from scipy.stats import norm
norm.ppf(quantile, loc=0, scale=1)

# Standard error method for confidence interval
point_estimate = np.mean(coffee_boot_distn)

std_error = np.std(coffee_boot_distn, ddof=1)

from scipy.stats import norm
lower = norm.ppf(0.025, loc=point_estimate, scale=std_error)
upper = norm.ppf(0.975, loc=point_estimate, scale=std_error)