# Lecture 1: Getting to Know Your Data

Author: Sebastian Torres-Lara

## Overview

Congrats, you have been hired as a data scientist at winery ! Your employer wants you to go over a dataset containing chemical information form over a thousand different wines.
Your goal is to explore the dataset to understand the impact of each messaur

# Tools of the Trade

Data Science is constantly evolving whether it's new fancy machine learning algorithms, or an update to ChatGPT.
My point is there are a lot of cool libraries, code editors, algorithm that
As a data scientist you will have a wide array of tools at your disposal.

For this lecture series we will be using Jupyter Notebooks to write and run our code

# 1. Getting your Data

To start we need to get our hands on some data ! Luckily there are a ton of public data repos out there.

For this case we will be using the

*Side Note*:

Download the red wine quality data set (run the cell bellow) from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)
- This dataset contains information about Vinho Verde, or green grape wine from Portugal

In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

Check if your wine dataset has been downloaded, best to keep it in the same directory as this notebook

In [None]:
!ls

### Quick Tangent: Terminal Commands
In case you're wondering, the ! allows you to input terminal commands in a Jupyter cell.
While knowing terminal commands is not a must, they'll make you a more efficient coder
1. `ls` (list) - displays a list of files and directories in the current directory.
    * Example: `ls`

2. `cd` (change directory) - changes the current working directory to the specified directory.
    * Example: `cd /home/user/Documents`

3. `mkdir` (make directory) - creates a new directory with the specified name.
    * Example: `mkdir new_directory`

4. `rm` (remove) - deletes a file or directory.
    * Example: `rm file.txt` or `rm -r directory`

5. `cp` (copy) - copies a file or directory to a new location.
    * Example: `cp file.txt /home/user/Documents`

6. `mv` (move) - moves a file or directory to a new location or renames it.
    * Example: `mv file.txt new_location/file.txt` or `mv file.txt new_name.txt`

7. `touch` - creates a new empty file with the specified name.
    * Example: `touch new_file.txt`

8. `cat` (concatenate) - displays the contents of a file.
    * Example: `cat file.txt`

9. `grep` (global regular expression print) - searches for a specific pattern in a file or files.
    * Example: `grep "hello" file.txt`

10. `sudo` (superuser do) - executes a command with administrative privileges.
    * Example: `sudo apt-get update`


# 2. Understanding Your Data

Now that we have our dataset we need to load it onto our notebook.
To do this we will use the *pandas* library, a powerful data manipulation and analysis library
- [Pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html)

Import pandas

In [None]:
import pandas as pd

Load the data using `read_csv` and save it as `df`
- Loads dataset onto the notebook as a [Dataframe](https://pandas.pydata.org/docs/reference/frame.html) object
- For this dataset we have to use `delimiter` to specify the character or sequence of characters used to separate values in a  file when reading it into a pandas DataFrame

In [None]:
df = pd.read_csv('winequality-red.csv', delimiter=';')

Get the shape of your dataframe using: `df.shape`
- output looks like this (row,column)
- You'll often hear/read that the number of columns is referred as the number of features
- Number of rows is also referred as the length of the dataset

In [None]:
df.shape

Use `df.info()` to get some info...
This functions will return the `Column name`, `Non-Null count` (count of real values), and `Dtype `

In [None]:
df.info()

view the first 10 rows using: `df.head(10)`

In [None]:
df.head(10)

view the last 10 rows using `df.tail(10)`

In [None]:
df.tail(10)

view 10 random rows using `df.sample(10)`

In [None]:
df.sample(10)

Check the data types of each column (col) using: `df.dtypes`

In [None]:
df.dtypes

Check if there are any NaN value (pandas data type for missing value) and count them using: `df.isna().sum()`

In [None]:
df.isna().sum()

# 2. Statistics

Understanding the statics of your dataset is a must !!!


Let's start by using `df.describe()` to show  statistical information() of your dataframe.
- count: Number of non-null observations for each column
- mean: Arithmetic mean of each column (average)
- std: Standard deviation of each column
- min: Minimum value of each column
- 25%: First quartile (25%) of each column
- 50%: Median (50%) of each column
- 75%: Third quartile (75%) of each column
- max: Maximum value of each column


In [None]:
df.describe()


- Median: The median is a measure of central tendency that represents the middle value of a dataset. It is calculated by sorting the values in the dataset in ascending or descending order, and then selecting the value that is exactly in the middle. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number of values, the median is the average of the two middle values.

- Standard Deviation: The standard deviation is a measure of the spread of a dataset. It measures how much the values in the dataset deviate from the mean. A low standard deviation indicates that the values are clustered around the mean, while a high standard deviation indicates that the values are more spread out. The standard deviation is calculated by taking the square root of the variance.

- Quartiles: The quartiles are values that divide a dataset into quarters. The 25th quartile (Q1) is the value that is greater than or equal to 25% of the values in the dataset. The 50th quartile (Q2) is the same as the median. The 75th quartile (Q3) is the value that is greater than or equal to 75% of the values in the dataset. The interquartile range (IQR) is the difference between the 75th and 25th quartiles, and it represents the middle 50% of the dataset.


If using `descrtibe()` seems overwhelming you can get key stats (using their unique attribute call) for the entire dataframe or col(s)

Getting maxium values for each col in the dataframe

In [None]:
df.max()

For a single col, for this case we want the max of `Ph`

In [None]:
df['pH'].max()

For a multiple cols, for this case we want the max of `Ph and chlorides`

In [None]:
df[['pH', 'chlorides']].max()

Here's a list of other useful stats call. Same col selection or whole df applies.
- `min()`
- `mean()`
- `mode()`
    - returns the most frequent value of the df/col
- `median()`
- `quantile(q)`
    - 0 <=q <= 1  (25th == 0.25)
- `std()`

## Correlation (Pearson Correlation)
Correlation measures the strength and direction of the *linear relationship* between two continuous variables.
It is denoted by the symbol "r" and ranges between -1 and 1, where -1 indicates a perfect negative linear correlation, 0 indicates no correlation, and 1 indicates a perfect positive linear correlation.

You can get the correlation of your df using: `df.corr()`

In [None]:
df.corr()

# Data Visualization Using Plotly

Let's be honest seeing a bunch of numbers on your screen can be confusing, especially when you are trying understand their behavior.
This where data visualization comes into play.
There are a bunch of awesome visualization tools out there such as: Matplotlib, Seaborn, Holoviews, Plotly, and many more
For this section we will be using Plotly

import plotly's express library

In [None]:
from plotly import express as px

Bar graphs are great for showing how occurrences are per unique value of a feature

In [None]:
px.bar(df['quality'].value_counts(), title=r'$\text{You can also flex on people by using LaTex, here a bunch of math stuff: } \int \sigma (x)^{420}dx$').update_xaxes(title='Rating').update_yaxes(title='Count')

Our data (chemical measurements) is composed of continuous distributions. To visualize this kind of data is best to use a [histogram](https://datavizcatalogue.com/methods/histogram.html)

In [None]:
px.histogram(df,'pH')

In [None]:
px.histogram(df,'citric acid')

There are all sort of data visualization methods, and sometimes it's trivial which kind of plot to use.
However, [The Data Visualisation Catalogue](https://datavizcatalogue.com/index.html) is great guide on how use different types of visualizations.

You can visualize the correlation using

In [None]:
px.imshow(df.corr())

# Linear Regression

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables.
In its simplest form, linear regression assumes a linear relationship between the variables.
The goal of linear regression is to find the line that best fits the data, so that we can use it to predict values of the dependent variable for new values of the independent variable(s).

For this section we will be utilizing the Sci-Kit Learn library  a popular open-source machine learning library for Python that provides tools for various supervised and unsupervised learning algorithms, as well as data preprocessing and model evaluation techniques.
It is designed to be easy to use, efficient, and accessible to both novice and expert machine learning practitioners.

Here is the doc page for [Sci-Kit Learn Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

First, we import the necessary libraries: Pandas for data handling, NumPy for numerical operations, scikit-learn for machine learning, and StandardScaler for data scaling.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

Select the features (X) and the target variable (y).
We drop the 'quality' column from X, as we will use it to predict the target variable.

In [None]:
X = df.drop('quality', axis=1)
y = df['quality']


Use StandardScaler to scale the data. Scaling the data can help improve the accuracy of the model, as it ensures that each feature has a similar range of values.

The StandardScaler scales data by subtracting the mean from each data point and dividing the result by the standard deviation.
This transformation results in data with a mean of zero and a standard deviation of one.

In [None]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

Split the data into training and testing sets, using train_test_split.
This function randomly splits the data into two sets based on the test_size parameter (in this case, 20%).
We set the random_state parameter to ensure that we get the same split each time we run the code.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Fit the linear regression model to the training data using the fit() method of the LinearRegression object.

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

We use the model to make predictions on the test data using the predict() method of the LinearRegression object.

In [None]:
y_pred = regressor.predict(X_test)

Calculate the R-squared score of the model using the score() method of the LinearRegression object.
The R-squared score is a measure of how well the model fits the data, with a score of 1 indicating a perfect fit.

R-squared is the  measure of the goodness of fit of a regression model.
It helps to determine how well the model fits the data and whether the independent variables are able to explain a significant portion of the variation in the dependent variable.

In [None]:
score = regressor.score(X_test, y_test)

print('R-squared score:', score)

# Decision Trees

Decision trees are a type of supervised learning algorithm used for classification and regression.
In classification, a decision tree learns a mapping from input features to discrete output classes.
The tree is built by recursively splitting the data into subsets based on the values of the input features, with the goal of maximizing the separation between the output classes.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

Make a new X and y, this time let's call them X_tree and y_tree.
Scale X_tree using the `StandardScaler`
Then split your data into training and testing sets using sk's `train_test_split()`


In [None]:
X_tree = df.drop('quality',axis=1)
y_tree = df['quality']
scaler = StandardScaler()
X_tree = scaler.fit_transform(X_tree)
X_train, X_test, y_train, y_test = train_test_split(X_tree, y_tree, test_size=0.2, random_state=42)

Define the `DecisionTreeClassifier` as `clf` and then fit it with your training data

In [None]:
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

After fitting make a prediction and call it `y_tree_pred`

In [None]:
y_tree_pred = clf.predict(X_test)

Get the accuracy of your classifier by using `accuracy_score(y_test, y_tree_pred)`

In [None]:
acc = accuracy_score(y_test, y_tree_pred)
print(f'Accuracy: {acc}')

The confusion matrix is a table used to evaluate the performance of a classification model by comparing the predicted and actual labels.

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
cm = confusion_matrix(y_true=y_test, y_pred=y_tree_pred)
print(cm)

To make our confusion matrix more readable lets turn it into a dataframe and add labels

To get our labels, we'll use `df['quality'].unique()` to get a list of all the unique values and use numpy's `sort()` to sort all the values.

In [None]:
import numpy as np
cm_label = np.sort(df['quality'].unique(), axis=0)
cm_df = pd.DataFrame(cm, columns=cm_label, index=cm_label)

In [None]:
cm_df