# Lecture 1: Getting to Know Your Data

Author: Sebastian Torres-Lara

## Overview

## Objective
In this notebook you'll some coding example for the some of the fundamentals of data science.

# 1. Getting your Data

To start we need to get our hands on some data ! Luckily there are a ton of public data repos out there.

For this case we will be using the

*Side Note*:

Download the red wine quality data set (run the cell bellow) from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)
- This dataset contains information about Vinho Verde, or green grape wine from Portugal

In [15]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

--2023-03-23 22:39:30--  https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84199 (82K) [application/x-httpd-php]
Saving to: ‘winequality-red.csv.1’


2023-03-23 22:39:31 (750 KB/s) - ‘winequality-red.csv.1’ saved [84199/84199]



Check if your wine dataset has been downloaded, best to keep it in the same directory as this notebook

$\text{flex}^2$

In [16]:
!ls

Lecture_1.ipynb  winequality-red.csv  winequality-red.csv.1


### Quick Tangent
In case you're wondering, the ! allows you to input terminal commands.
While knowing terminal commands is not a must, but they'll make you a more efficient coder
1. `ls` (list) - displays a list of files and directories in the current directory.
    * Example: `ls`

2. `cd` (change directory) - changes the current working directory to the specified directory.
    * Example: `cd /home/user/Documents`

3. `mkdir` (make directory) - creates a new directory with the specified name.
    * Example: `mkdir new_directory`

4. `rm` (remove) - deletes a file or directory.
    * Example: `rm file.txt` or `rm -r directory`

5. `cp` (copy) - copies a file or directory to a new location.
    * Example: `cp file.txt /home/user/Documents`

6. `mv` (move) - moves a file or directory to a new location or renames it.
    * Example: `mv file.txt new_location/file.txt` or `mv file.txt new_name.txt`

7. `touch` - creates a new empty file with the specified name.
    * Example: `touch new_file.txt`

8. `cat` (concatenate) - displays the contents of a file.
    * Example: `cat file.txt`

9. `grep` (global regular expression print) - searches for a specific pattern in a file or files.
    * Example: `grep "hello" file.txt`

10. `sudo` (superuser do) - executes a command with administrative privileges.
    * Example: `sudo apt-get update`


# 2. Understanding Your Data

Now that we have our dataset we need to load it onto our notebook.
To do this we will use the *pandas* library, a powerful data manipulation and analysis library
- [Pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html)

Import pandas

In [17]:
import pandas as pd

Load the data using `read_csv` and save it as `df`
- Loads dataset onto the notebook as a [Dataframe](https://pandas.pydata.org/docs/reference/frame.html) object
- For this dataset we have to use `delimiter` to specify the character or sequence of characters used to separate values in a  file when reading it into a pandas DataFrame

In [18]:
df = pd.read_csv('winequality-red.csv', delimiter=';')

Get the shape of your dataframe using

view the first 10 rows using: `df.head(10)`

In [19]:
df.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
5,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5
6,7.9,0.6,0.06,1.6,0.069,15.0,59.0,0.9964,3.3,0.46,9.4,5
7,7.3,0.65,0.0,1.2,0.065,15.0,21.0,0.9946,3.39,0.47,10.0,7
8,7.8,0.58,0.02,2.0,0.073,9.0,18.0,0.9968,3.36,0.57,9.5,7
9,7.5,0.5,0.36,6.1,0.071,17.0,102.0,0.9978,3.35,0.8,10.5,5


view the last 10 rows using `df.tail(10)`

In [20]:
df.tail(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
1589,6.6,0.725,0.2,7.8,0.073,29.0,79.0,0.9977,3.29,0.54,9.2,5
1590,6.3,0.55,0.15,1.8,0.077,26.0,35.0,0.99314,3.32,0.82,11.6,6
1591,5.4,0.74,0.09,1.7,0.089,16.0,26.0,0.99402,3.67,0.56,11.6,6
1592,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1593,6.8,0.62,0.08,1.9,0.068,28.0,38.0,0.99651,3.42,0.82,9.5,6
1594,6.2,0.6,0.08,2.0,0.09,32.0,44.0,0.9949,3.45,0.58,10.5,5
1595,5.9,0.55,0.1,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5
1598,6.0,0.31,0.47,3.6,0.067,18.0,42.0,0.99549,3.39,0.66,11.0,6


view 10 random rows using `df.sample(10)`

In [21]:
df.sample(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
382,8.3,0.26,0.42,2.0,0.08,11.0,27.0,0.9974,3.21,0.8,9.4,6
1010,8.9,0.28,0.45,1.7,0.067,7.0,12.0,0.99354,3.25,0.55,12.3,7
144,5.2,0.34,0.0,1.8,0.05,27.0,63.0,0.9916,3.68,0.79,14.0,6
953,10.2,0.34,0.48,2.1,0.052,5.0,9.0,0.99458,3.2,0.69,12.1,7
1408,8.1,0.29,0.36,2.2,0.048,35.0,53.0,0.995,3.27,1.01,12.4,7
313,8.6,0.47,0.3,3.0,0.076,30.0,135.0,0.9976,3.3,0.53,9.4,5
490,9.3,0.775,0.27,2.8,0.078,24.0,56.0,0.9984,3.31,0.67,10.6,6
761,9.3,0.655,0.26,2.0,0.096,5.0,35.0,0.99738,3.25,0.42,9.6,5
1009,9.6,0.5,0.36,2.8,0.116,26.0,55.0,0.99722,3.18,0.68,10.9,5
1260,8.6,0.635,0.68,1.8,0.403,19.0,56.0,0.99632,3.02,1.15,9.3,5


Use `.describe()` to show  statistical information about your dataframe.
- count: Number of non-null observations for each column
- mean: Arithmetic mean of each column
- std: Standard deviation of each column
- min: Minimum value of each column
- 25%: First quartile (25%) of each column
- 50%: Median (50%) of each column
- 75%: Third quartile (75%) of each column
- max: Maximum value of each column


In [22]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


Check if there are any NaN value (pandas data type for missing value) and count them using: `df.isna().sum()`

In [23]:
df.isna().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

Check the data types of each column (col) using: `df.dtypes`

In [24]:
df.dtypes

fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
dtype: object