# Data Science Ex 14 - Exercises

10.05.2023, Lukas Kretschmar (lukas.kretschmar@ost.ch)

## Let's have some Fun with Data Science!

In this exercise, your are going to have a look at what you can do now with Data Science.
Basically, you need to apply many of the approaches you've learned in this module.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

## Exercises

We do not provide any help other than the data, and some instructions and boundaries per exercise.
The rest is up to you.

Thus, follow in every exercise these steps:
- Loading the data
- Preprocessing the data (incl. checking the data)
- Creating the model (you may have to import the model class first)
- Use the model
- Optimize the model
- Visualize the results

**Note:** Depending on the exercise, some steps may not not be required.

**Hint:** It may be a good idea to add more code-cells.
We just provide one, but splitting the code into mutiple cells makes it easier.
And you do not destroy work you've done.

### Data Visualization

#### Ex01 - Simple Plots

In this exercise, you must create simple charts with matplotlib.

Plot `sin(x)` and `cos(x)` in one figure within the range of `(-6, 6)`.
`sin(x)` should be drawn in green and dotted, `cos(x)` in red and dashed.
And don't forget to show the labels in a legend.

Plot the line of a normal distribution (random numbers) as a histogram.

Plot the normal distribution (random numbers) as a line.

Plot a pie chart.
Each slice has 1/4 of the pie.
The chart should be tilted by 45° so every slice points into on direction (left, top, right, bottom).
The slice on the right should be highlighted.

Create a scatter plot with 40 points.
Use a normal distribution to generate coordinates.
The size of each point is the square of its x-value times 200.
The color of the point is the square of its y-value.
Use a bit of transparency to see overlapping points.
And points should be shown as triangles.

##### Solution

In [None]:
# %load ./Ex14_Plot-01_Sol.py

#### Ex02 - Technology Index Plots

In this exercise, you must create plots based on given data.
The dataset contains average prices for certain products.

Load the following files:
- **Ex14_Plot-02_Data-Prices.csv** contains the main dataset
- **Ex14_Plot-02_Data-Countries.csv** contains the continent a country belongs to
- **Ex14_Plot-02_Data-Population.csv** contains data about a countries population

You may have to preprocess the prices first.
Combine the three datasets into one, so you have the continent and population of each country in one dataset with the pricing data.
Then take this dataset and create the following plots in a 2x2 grid.
- **Top left**: Horizontal bar chart of the top 5 countries with the most expensive iPhone prices.
And show the Android prices for those countries as well.
Limit the x-axis to \$8000 (you see why).
- **Top right**: Pie chart showing how many countries have higher prices for MacBooks compared to Windows machines.
Use the following segments `> 200%, 100% - 200%, 50% - 100%, 25% - 50%, 0% - 25%, < 0%`.
Highlight the smallest and biggest segment.
- **Bottom left**: Scatter plot comparing prices for an XBox One and PS4.
Use the population for the size of each point (you may have to reduce the value) and the continent for coloring.
Since Venezuela is again off the charts, limit the plot.
And highlight Switzerland in the plot.
- **Bottom right**: Histogram of prices of Smart TVs, Headphones and 2TB HDDs.
You may have to increase the number of bins.
And ignore Venezuela, again.

##### Solutions

In [None]:
# %load ./Ex14_Plot-02_Sol.py

### Classification

#### Ex01 - Ford Review Classifier

In this exercise, you should create and train a classifier (Naïve Bayes) to predict if a car review is about a Ford or another car manufacturer.
First, train your classifier with the content of **Ex14_Class-01_Data-Train.csv** and visualize/show its accuracy.
Then predict the labels for the content in **Ex14_Class-01_Data-Holdout.csv**.

##### Solution

In [None]:
# %load ./Ex14_Class-01_Sol.py

#### Ex02 - Telecom Churn Classifier

In this exercise, you should create, train and optimize a classifier (RandomForest) to identify the likelihood if a customer will change its current telecom provider or not.
Use **Ex14_Class-02_Data-Train.csv** to prepare your model.
The column `Churn` contains the information whether a customer will change or not.
And then predict the likelihoods for the data in **Ex14_Class-02_Data-Holdout.csv**.

Before you can start with the training, the data has to be preprocessed since some columns have categorical values or contain text instead of numbers.

##### Solution

In [None]:
# %load ./Ex14_Class-02_Sol.py

### Clustering

#### Ex01 - NYC Wi-Fi Clustering

In this exercise, you should find cluster of Wi-Fi hotspots in New York City.
Load the data from **Ex14_Clust-01_Data.csv** and visualize the Wi-Fi hotspots.
Go on by selecting an appropriate clustering algorithm and find clusters in the data.
You may have to change some hyperparameters.
For the sake of simplicity, work only with `longitude` and `latitude` to find location-based clusters.
Use `euclidean` and `manhattan` (obviously) distances for clustering and show both plots side by side.

##### Solution

In [None]:
# %load ./Ex14_Clust-01_Sol.py