# Python Beginner Workshop fot Data Science
# Part II. Introduction to Data Analysis 🐧

In this part, we will be working with a Penguins dataset to get an overview of data analysis in Python.

Data that we work with is usually stored in files (e.g. in .csv format) and if you are planning to dive deeper in data analysis, it is very important to know how to handle files appropriately. You can check out this tutorial to get an overview of basic approaches and commands: https://python.land/operating-system/python-files.

For simplicity, we will not touch on that part now. Instead, we will use a `Seaborn` library that is mostly used for statistical plotting (visualization) in Python but also contains some datasets which can be used for basic exploratory data analysis. 

Let's first import `Seaborn`:

In [1]:
import seaborn as sns

In [None]:
# Checking which datasets are offered by Seaborn
print(sns.get_dataset_names())

You can check out each of these datasets here: https://github.com/mwaskom/seaborn-data.

### **Understanding the Dataset and Importing Libraries**

Let's load our dataset using a Seaborn built-in `load_dataset()` method.

In [3]:
df=sns.load_dataset('penguins') # df here stands for data frame (you can also use a different variable name)

**DataFrame** is a two-dimensional data structure that consists of three components: data, rows and columns.

Let's see how it looks:

In [None]:
df

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.columns

**Penguins Columns Features**

`species`: a factor denoting penguin type (Adélie, Chinstrap and Gentoo)

`island`: a factor denoting island in Palmer Archipelago, Antarctica (Biscoe, Dream or Torgersen)

`bill_length_mm`: a number denoting bill length (millimeters)

`bill_depth_mm`: a number denoting bill depth (millimeters)

`flipper_length_mm`: an integer denoting flipper length (millimeters)

`body_mass_g`: an integer denoting body mass (grams)

`sex`: a factor denoting sexuality (female, male)




Let's now import other libraries that we will use for data analysis.


`NumPy` (short for Numercial Python): a Python library used for scientific computing that allows us to work with multi-dimensional arrays and matrices.

`pandas`: library used for working with datasets. Contains functions for analyzing, cleaning, exploring, and manipulating data.

`Matplotlib`: a library for creating static, animated, and interactive visualizations in Python

In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

We can use DataFrame `describe()` pandas method to view some basic statistical details about the dataset.

In [None]:
df.describe()

And get information about the data frame with DataFrame `info()` pandas method.

In [None]:
df.info()

### **Data Cleaning**

Real-world data is most often very noisy, might contain some missing values, duplicates, outliers, be incorrectly formatted or mislabeled within the dataset.

**Data Cleaning** is an essential step in data analysis that ensures the accuracy, consistency, and completeness of the data used for analysis. 

For example, we might want to first check whether our dataset has any missing data. Missing data is commonly recorded as `NaN`, Not a Number, value (available through the **NumPy** library).

Let's use pandas `isnull()` built-in method (for a DataFrame structure) to detect missing values in our dataset and `sum()` to see how many missing values are in each feature column:

In [None]:
df.isnull().sum()

We don't have that many rows with missing values (among 344 data entries), so we can disregard them in our analysis.

We will do that using a DataFrame `dropna()` pandas method. 

In [12]:
df.dropna(inplace=True)

Let's check the documentation for this method to understand why we use `inplace=True` argument:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

In [None]:
df.isnull().sum()

We can see that our dataframe now does not contain any missing values.

In [None]:
len(df)

Checking for missing value is, of course, not the only important aspect of data cleaning and preparation.

If you would like to learn a bit more, then check out this short Beginner's Guide to Data Cleaning by DataCamp: https://www.datacamp.com/tutorial/guide-to-data-cleaning-in-python.

### **Exploratory Data Analysis**

#### **Pie Chart**

Let's day we first would like to see how many penguins of each species there are in the dataset (as %).

A **pie chart** is perfect to illustrate numerical proportion.

We use DataFrame `value_counts()` pandas method:

In [None]:
Species = df.species.value_counts()
Species

We can use **pandas** to create some simple plots:

In [None]:
Species.plot(kind="pie", colors=["c", "m", "lightblue"], ylabel="", autopct="%.2f%%");

Find out which other parameters can be specified for `pandas.DataFrame.plot` in the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html.

**Exercise**

Create a pie chart that shows percentage of Male and Female penguins in the dataset.

In [17]:
# Your solution goes here



#### **Histogram**

Let's say we now want to see how the value of some variable is distributed in the dataset.

A **histogram** is a great chart to visually represent distribution of data as series of bars.

This time we'll use **Matplotlib** for plotting.

In [None]:
flipper_length = df.flipper_length_mm
flipper_length

In [None]:
plt.hist(flipper_length, bins=15, color="lightblue", edgecolor='#169acf')  # 'bins' argument controls the number of bars in the histogram

plt.title('Flipper Length')  # Optionally setting a title
plt.xlabel('Flipper length (mm)')  # Setting the x-axis label
plt.ylabel('Frequency')  # Optionally setting the y-axis label
plt.grid(axis='y', alpha=0.3)

plt.show() # Displaying the plot

Refer here for `matplotlib.pyplot.hist` documentation: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html.

**Exercise**

Create a histogram that shows the distribution of a body mass variable value.

In [20]:
# Your solution goes here



#### **Scatter Plot**

Sometimes we might want to see the relationship between two variables looking at each data point more closely. For this, scatter plot would be a useful visualisation tool.

A **scatterplot** is a type of data display that shows the relationship between two numerical variables. 

We will create a scatter plot with the help of **Seaborn** library. For example, let's look at the scatter plot of bill lengths and depths by penguin species.

In [None]:
# We can improve the resolution of the plot by setting DPI (Dots Per Inch)
# plt.figure(dpi=300)

sns.scatterplot(x="bill_length_mm", y="bill_depth_mm", data=df, hue="species")
plt.title("Bill Length vs Bill Depth", size=20, color="red");

Refer to `seaborn.scatterplot` to look closer at the parameters: https://seaborn.pydata.org/generated/seaborn.scatterplot.html.

**Exercise**

Create a scatter plot to see the the relationship between penguins' body mass and flipper length. Group them by species.

In [22]:
# Your solution goes here

