# Data Science Ex 13 - Doing Data Science

10.05.2023, Lukas Kretschmar (lukas.kretschmar@ost.ch)

## Let's have some Fun doing Data Science

This exercise is a collection of exercises that you can solve with all the tools you've seen in this course.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

In [None]:
# Preventing memory leak warnings when running KMeans
import os
os.environ["OMP_NUM_THREADS"] = "1"

**Please note:** You may not know all the parameters you have to set, so you may need to look them up online.
I didn't say, it's gonna be easy, but so it is when you work with fresh, unknown data outside of this course.

Also, feel free to add additional cells to code.
The following exercises just provide one, but sometimes it's easier splitting the code into multiple cells.

## Numpy

In this part, you start with the basics and use Numpy a bit.

### Random Number Generator

Generate 3 random numbers.

Create two RNG's, and generate 5 normal distributed (N(0,1)) numbers with each RNG.
The numbers must be be the same.

Generate 10 uniform distributed integers between -10 and 10.
Round the values instead of truncating them.

Create a 4 x 4 matrix containing normal distributed (N(0,2)) values.

#### Solution

In [None]:
# %load ./Ex13_Numpy_Random_Sol.py

### Boolean

Create an array with the numbers from 0 to 9.

Based on that array, create an array containing boolean values indicating if a value is below 5 or not.

Use np.all() to check that all your values are just single digit numbers.

Use np.any() to check if a value is bigger than 9.

#### Solution

In [None]:
# %load Ex13_Numpy_Bool_Sol.py

### Statistics

Generate 1000 normal-distributed (N(5,1)) values.

Show the maximum and minimum values of the generated numbers.

Show the mean, standard deviation and median of the values.

Show the boundaries of the 25%, 50% and 75% quartiles for the given values.

Show the boundaries for the top 10% and top 1% of the values.

#### Solution

In [None]:
# %load ./Ex13_Numpy_Stats_Sol.py

## Pandas

In this part, you'll use Pandas to load, fix and filter data.

### Dataset Information

Load the dataset **Ex13_Pandas_Info_Data.csv**.

Print the information (columns & data types) for the dataset.

Display a summary of all the data in the dataset.
Use the `include="all"` argument.

#### Solution

In [None]:
# %load ./Ex13_Pandas_Info_Sol.py

### Missing Data

Load the dataset **Ex13_Pandas_Missing_Data.csv**.

As you can see, you have a list of car specs and some values are missing.

Check the *percentage of missing values* and *absolute number of missing values* per column.
Show the result as a nice dataframe, not just the default print of an array.
And sort the result descending by the percentage.

Replace the missing `acceleration` values with a fix value of `16`.
And check if all the values got replaced.

Fill in the `years` using the *forward fill* method and then the *backward fill* method to get the missing value in the first row.
And check that all the missing values got replaced.

Now, we assume that the `weight` of the car is entangled with the gas usage.
Thus, replace the missing `weights` with the *average weight* per group of vehicles that have the same `mpg` value.
And check if the fix worked.

As you can see, there are some `weights` that could not be filled.
For those, let's just set the `weight` to the *median* `weight` of all the cars in the dataset.

And for the missing `displacement` values, use the *mean* of the `cylinders` and `horsepower` combination.
And if there are sill some values missing, replace them just by the *mean* of all the cars with the same number of `cylinders`.

Show that there are no values missing in your dataset.

#### Solution

In [None]:
# %load ./Ex13_Pandas_Missing_Sol.py

### Binning

Load the dataset of **Ex13_Pandas_Binning_Data.csv**.

Create 4 equi-width bins based on the values in column `Education`.

Create 5 bins with the data in the `Age` column using the following boundaries `[0, 20, 40, 60, 80, 100]`.
Show the resulting bins in the expected order (not ordered by the entries per bin, but the boundaries).

Create 6 equi-depth bins based on the values in column `Age`.

Create bins with the following quantiles `[0,.1,.33,.5,.75,.9,.99,1]` on the `Income` column.
Show the resulting bins in the expected order.

#### Solution

In [None]:
# %load ./Ex13_Pandas_Binning_Sol.py

### Sampling

Load the dataset from **Ex13_Pandas_Sampling_Data.csv**.

Show the number of entries per category found in column `Ethnicity` within a bar chart.

You see that one category is quite more frequent than the other two.
Thus, let's bring them to the same level.

Select *150* samples of values with ethnicity `Caucasian`.

Select 150 samples of values with ethnicity `Asian`.

Select 150 samples of values with ethnicity `African American`.

Combine the three samples from above into one new datasets and show within a bar chart that the imbalance is gone.

#### Solution

In [None]:
# %load ./Ex13_Pandas_Sampling_Sol.py

## Data Visualization

Well, now let's have some fun with data visualization.

Load the dataset **Ex13_DataViz_Data.csv**.
The dataset contains avocado pricing and sales information.

Show a list of all the available regions.
You know now how the names should look like in the following exercises.

#### Solution

In [None]:
# %load ./Ex13_DataViz_Init_Sol.py

### Line Plot

Show the average price over time in *Boston*, *Chicago*, *NewYork*, *LosAngeles*, and *SanFrancisco* in a line plot.
Use the cities as labels for the lines.
And make sure that the values are in order.

#### Solution

In [None]:
# %load ./Ex13_DataViz_Line_Sol.py

### Styling Lines

Show the total volume sold over time in the *West*, *Southeast*, and *Northeast*.
Style the lines as follows:
- West: Blue and dashed
- Northeast: Green and dash dot
- Southeast: Red and dotted

#### Solution

In [None]:
# %load ./Ex13_DataViz_Styling_Sol.py

### Bar Chart

Show the total bags sold per year for the whole USA (*TotalUS*) in a bar chart.

#### Solution

In [None]:
# %load ./Ex13_DataViz_Bar_Sol.py

### Stacked Bar Chart

Show the total volume sold per year and region in a horizontal bar chart.
The bars should be stacked to the right of each other, and ignore the volume for the whole US since we look at the regions individually.

**Bonus**: Try to sort the regions by their total over all years

#### Solution

In [None]:
# %load ./Ex13_DataViz_BarH_Sol.py

### Pie Chart

Draw pie charts with the total volume sales per region (one chart per region).
The regions should be *California*, *Portland*, and *Houston*.
And you should show the percentage per year.

**Bonus**: Highlight per pie the largest piece.

#### Solution

In [None]:
# %load ./Ex13_DataViz_Pie_Sol.py

### Scatter Plot

Plot a comparison of the total bags and total volume per year & region sold in a scatter plot.
Use the year for coloring, and the average price for the size.
Ignore *TotalUS* as region.

#### Solution

In [None]:
# %load ./Ex13_DataViz_Scatter_Sol.py

### Area Plot

Create an area plot showing the volume per avocado PLU (Price Look Up) *4046*, *4225*, and *4770* over time.
Again, ignore the *TotalUS* numbers.

#### Solution

In [None]:
# %load ./Ex13_DataViz_Area_Sol.py

### Layouting

Let's show some visualizations in a grid rather then shwoing them alone.
For this exercise, we concentrate on the following regions: *Chicago*, *Detroit*, *Philadelphia*, *San Diego*, and *Seattle*
Create the following visualization in a 2x3 grid:
- **\[Top Left\]:** Bar chart with small bags, large bags, and xlarge bags per region of 2017.
- **\[Top Center\]:** Pie chart showing the total volume sold in 2016 per region and highlight the largest piece.
- **\[Top Right\]:** Line plot with the average price per region over time.
- **\[Bottom Left\]:** Horizontal bar chart showing the volumes sold per PLU per region per year. PLUs should be shown beside each other but the yearly values should be stacked. **$\Longleftarrow$ This one is not easy**
- **\[Bottom Center\]:** Area plot with the total sales over time per region.
- **\[Bottom Right\]:** Scatter plot comparing the total volume sold of avocado with PLU 4046 and 4225 per month in 2015. Use the region for coloring and total volume for sizing. **$\Longleftarrow$ This one is a challenge**

#### Solution

In [None]:
# %load ./Ex13_DataViz_Layout_Sol.py

## Association Analysis

Here, you'll do a market basket analysis.

### Preprocessing

If you want to try to process the raw data, have a try.
Otherwise, you can just go on with the analysis down below.
You'll find the marked basked data in **Ex13_MBA_Data_Raw.csv**.
At the end, you should have a dataframe containing just `false` and `true`, and the columns should be the names of the products in the basket.

#### Solution

In [None]:
# %load ./Ex13_MBA_Preprocessing_Sol.py

### Analysis

Now, let's create the frequent item sets (*support* >= 5%) and rules.
At the end, just show the rules with a positive *lift*.

Feel free to dig deeper into the data.
You can either use your processed data from the exercise above, or start with the provided dataset found in **Ex13_MBA_Data.csv**.

#### Solution

In [None]:
# %load ./Ex13_MBA_Analysis_Sol.py

## Classification

Recap of the classification approaches of this course.

### Naïve Bayes

Create a model predicting a movie review is positive or negative.
As data, use the reviews found in **Ex13_Class_NB_Data.csv**.

- Show your model's performance using cross validation. (*Note:* We use just the raw data, without chaning the model's hyperparameters. Thus, the scores might not be that good.)
- Show the confusion matrix for your model.
- Test your model with 4 reviews. Just write them and use them against your model.

#### Solution

In [None]:
# %load ./Ex13_Class_NB_Sol.py

### Random Forest

Let's have some fun with decision trees.
In dataset **Ex13_Class_RandomForst_Data.csv** you'll find details on 800 Pokémon.
Try to train the best *random forest classifier* to predict the primary type (*type1*) of a Pokémon based on their effectiveness they have against the other types (*against_\**).

- Show the accuracy score of your model
- Show the confusion matrix of your classifier
- Show 2 decision trees of your model

#### Solution

In [None]:
# %load ./Ex13_Class_RandomForest_Sol.py

## Clustering

Let's work on some basic clustering.

### k-Means

Run a cluster analysis on the data in **Ex13_Clust_kMeans_Data.csv**.
The data contains a simple dataset of mall customers.
Show the clusters and their centers in a scatter plot to get an idea what customers could be grouped together.

- Use income and spending score for the plot.
- Visualize the age distribution of each cluster in histograms.
- Ignore the gender for training the model, but show it in a bar chart.

#### Solution

In [None]:
# %load ./Ex13_Clust_kMeans_Sol.py

### k-Medoids

Run a cluster analysis on the data found in file **Ex13_Clust_kMedoids_Data.csv**.
You'll find some credit card information, and how the users have used it.
Use the k-medoids algorithm, find a good number of clusters and visualize the data and cluster centers.

- Based on the data, you should get a good scatter plot when using the purchase frequency and balance.
- List the mean, standard deviation and median values of all clusters found.

#### Solution

In [None]:
# %load ./Ex13_Clust_kMedoids_Sol.py