## Python for Data Analysis and Visualization

This workshop is developed by Ramandeep Makhija with the assistance of ChatGPT (2023)

Python is a high-level programming language known for its simplicity and readability. It provides a straightforward and beginner-friendly syntax that makes it easier to learn and write code. Python is widely used in various domains, including web development, data analysis, artificial intelligence, and automation. It offers a rich set of libraries and frameworks that simplify complex tasks and enable developers to build powerful applications efficiently. In summary, Python is a versatile and accessible programming language that is widely adopted for its simplicity, versatility, and extensive ecosystem of libraries and tools>

The best way to learn a programming language is to do something useful, therefore this hands-on introductory workshop will cover:

#### Working with Data in Python
* Introduction to data structures(list, dictionaries, tuples)
* Accessing and manipulating data in data structures

#### Data Analysis and Visualization
* Introduction to Pandas for data analysis
* Loading and exploring data with Pandas
* Data Cleaning and preprocessing
* Data Visualization with matplotlib and Seaborn

### Data Structures

Data structures are containers used to organize and store data in a computer program. They provide a way to efficiently access, manipulate, and represent data. 

Now we will go through some commonly used data structure in Python with some examples:

#### Lists:

Lists are ordered collections that can store elements of different data types. They are denoted by square brackets and can be modified (mutable).

##### Examples

In [None]:
# Empty list
a = []

# List with three different data types
b = [1, 2, 3, 'hello', True]

# You can also create a list using a list() constructor 
c = list((1, 2, 3))

In [None]:
# Use print statements to look at the values of the lists you created 

print(a)
print(b)
print(c)

In [None]:
# We can access elements with a list using the indices
# In Python indices start from 0 instead os 1

# To access first element of list b
print(b[0])

# Third element of list c is
print('The third element of list c is: ', c[2])

In [None]:
# We can also slice and access multiple elements from the list

# Create a new list d from 3rd to last element from list b
d = b[2:]
print(d)

In [None]:
## We can also add or remove elements from a list

# Lets print value of list a
print('List a before adding an element: ', a)

# Lets add an integer 10 to empty list a
a.append(10)
print('List a after appending: ', a)

# Removthe added integer 10 from list a
a.remove(10)
print('List a after removing: ', a)

##### Exercise
* Create a list called fruits with the following fruits: "apple", "banana", "orange", "grape", "kiwi". Print the third item in the list.
* Add "mango" to the fruits list. Print the updated list.
* Remove "orange" from the fruits list. Print the updated list.


In [None]:
## Write your exercise code below



#### Dictionaries:

Dictionaries store data as key-value pairs. Each value is associated with a unique key, allowing fast access to data. They are denoted by curly braces and can be modified (mutable).

##### Examples

In [None]:
# Empty dictionary using curly brackets
e = {}

# Empty dictionary using dict() constructor
f = dict()

# A two element dictionary with name as key and age as value using curly brackets
g = {'Sam' : 28, 'Abby' : 21}

# Two element dictionary with name as key and age as value using dict() constructor
h = dict([['Sam', 28], ['Abby', 21]])

In [None]:
# lets print the values of dictionaries above
print('Dictionary e: ', e)
print('Dictionary f: ', f)
print('Dictionary g: ', g)
print('Dictionary h: ', h)

In [None]:
## We can also create a dictionary of dictionaries

# Lets create a dictionary with age and department of two people
i = {'Sam': {'Age': 28, 'Department': 'Physics'}, 'Abby': {'Age': 21, 'Department': 'Computer Science'}}
print('Dictionary i: ', i)

In [None]:
# Now lets access the elements from a dictionay i 
print("Abby's information from dictionary i: ", i['Abby'])

# To Just acces Sam's department from dict i
print("Sam's department from dict i: ", i['Sam']['Department'])

In [None]:
## Dictionaries are mutable

# Add another person to dictionary i
i['Peter'] = {'Age': 22, 'Department': 'Physics'}

# Print the dictionary i afetr adding a new person
print(i)

##### Exercise
* Create a dictionary called student with the following key-value pairs: "name" - "John", "age" - 20, "city" - "New York". Print the value associated with the "age" key.
* Add a new key-value pair to the student dictionary: "grade" - "A". Print the updated dictionary.
* Change the value of the "age" key to 21. Print the updated dictionary.

In [None]:
## Write your exercise code below



#### Tuples:

Tuples are similar to lists but are immutable (cannot be modified once created). They are denoted by parentheses and are typically used to store related data.

##### Examples

In [None]:
# Create a tuple using round brackets
j = (1, 2, 3, 4)

# Create a tuple using tuple() constructor
k = tuple([1, 2, 3, 4, 5])

# We can also create a tuple from a list using a tuple constructor
l = tuple(b)

In [None]:
# lets print the values of tuples created
print('tuple j: ', j)
print('tuple k: ', k)
print('tuple l: ', l)

In [None]:
# We cannot change or update a value of a tuple, let's try
j[0] = 10

In [None]:
# however we can access the values of a tuple like we did for lists using indices
print('Fourth element from tuple l: ', l[3])

##### Exercise
* Create a tuple called colors with the following colors: "red", "green", "blue". Print the second item in the tuple.
* Attempt to modify an item in the colors tuple (e.g., colors[0] = "yellow"). Observe and explain the error.
* Unpack the colors tuple into three variables: color1, color2, and color3. Print the values of the variables.

In [None]:
## Write your exercise code below




To learn more about data structures, you can refer to [this](https://docs.python.org/3/tutorial/datastructures.html 'Data Structure') documentation. It provides detailed explanations and insights into different types of data structures and how they can be used in Python.

### Pandas

Pandas is a popular Python library for data manipulation and analysis. It offers powerful data structures and tools, making it widely used in various domains. It simplifies data tasks, integrates well with other libraries, and has a large and active community. 

Key features of Pandas:
* Data Structure - Pandas provides two primary data structures: Series and DataFrame.
* Data Manipulation - Pandas offers a wide range of functions and methods for data manipulation, such as filtering, sorting, merging, reshaping, and aggregating data.
* Data Analysis - Pandas offers a wide range of functions and methods for data manipulation, such as filtering, sorting, merging, reshaping, and aggregating data.
* Integration with other libraries - Pandas seamlessly integrates with other Python libraries, such as NumPy, Matplotlib, and scikit-learn. 
* Efficiency - Pandas is optimized for performance and efficiency. It handles large datasets effectively and provides fast data processing capabilities, making it suitable for working with real-world data.

To learn more about Pandas, you can refer to the official Pandas [documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide), which provides detailed explanations, examples, and tutorials.

Additionally, there are several online tutorials, courses, and books available that offer comprehensive guidance and practical examples for learning Pandas effectively.

#### Getting started with Pandas

To use Pandas, you need to install it first. You can install Pandas using the Python package manager pip by running the command: pip install pandas.

Once installed, you can import the Pandas library in your Python script or Jupyter Notebook using the following convention:

In [None]:
import pandas as pd

#### Importing other necessary libraries

Import other necessary libraries like Numpy, Matplotlib and Seaborn. Here's brief explanation for all three libraries and liks to their official documentations.

* NumPy:
NumPy is a Python library for numerical computing. It provides powerful tools for working with arrays and performing mathematical operations efficiently. NumPy's official documentation offers detailed explanations, examples, and tutorials on how to use the library effectively.

    Official documentation: [NumPy Documentation](https://numpy.org/doc/)

* Matplotlib:
Matplotlib is a plotting library for creating static, animated, and interactive visualizations in Python. It offers a wide range of plotting options and customization capabilities. The official Matplotlib documentation provides comprehensive guidance on using the library and showcases various plot types and customization techniques.

    Official documentation: [Matplotlib Documentation](https://matplotlib.org/stable/users/index.html)

* Seaborn:
Seaborn is a Python data visualization library built on top of Matplotlib. It provides a high-level interface for creating attractive statistical graphics. The official Seaborn documentation offers detailed explanations of the library's features, tutorials on using different plot types, and guidance on customization options.

    Official documentation: [Seaborn Documentation](https://seaborn.pydata.org/)

These official documentation links provide in-depth resources to learn more about the libraries, their functionalities, and how to use them effectively.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#### Dowloading and loading the dataset as dataframe

Let's start with using the Netflix dataset for exploratory data analysis (EDA) and visualizations. Here's a step-by-step guide to get you started:

Download the Netflix dataset:

Visit the Netflix Movies and TV Shows dataset page on Kaggle: https://www.kaggle.com/shivamb/netflix-shows
Click on the "Download" button to download the dataset file (e.g., netflix_titles.csv).

In [None]:
## Loading the dowloaded dataset as a dataframe
df = pd.read_csv('netflix_titles.csv')

#### Exploratory data analysis (EDA)

About this Dataset: Netflix is one of the most popular media and video streaming platforms. They have over 8000 movies or tv shows available on their platform, as of mid-2021, they have over 200M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as - cast, directors, ratings, release year, duration, etc.

Lets explore the dataset

In [None]:
## First 5 rows of the data
df.head()

In [None]:
## If you want to look at first 10 rows or more
df.head(10)

In [None]:
## Now lets check the dimension of the dataset - number of rows and columns
df.shape

In [None]:
## Now lets look at the column header/names
df.columns

In [None]:
## Overview of the columns
df.info()

An "object column" in Pandas refers to a column in a DataFrame that contains data of the Python object data type. It is a catch-all data type that can store various types of data, including strings, mixed data types, or custom objects. Object columns are flexible but may require conversion to more appropriate data types for efficient data manipulation and analysis.

In [None]:
## Lets look at missing values in the data set
df.isnull().sum()

In [None]:
sns.heatmap(df.isnull())

In [None]:
# Set a clearer color palette
cmap = sns.cm.rocket_r

# Increase the size of the heatmap
plt.figure(figsize=(10, 6))

# Add labels to the x and y axes
plt.xlabel("Columns")
plt.ylabel("Rows")

# Add a title to the heatmap
plt.title("Missing Values Heatmap")

# Customize the heatmap grid
sns.heatmap(df.isnull(), cmap=cmap, cbar=False)


##### Handling missing values

Drop rows with missing values

`df_cleaned = df.dropna()`

Fill missing value with specefic values

`df_filled = df.fillna('N/A')`

Replace missing values with mean, median or mode

`df['column_name'].fillna(df['column_name'].mean(), inplace=True)`

Remember to consider the nature of the data and the impact of handling missing values in a specific way. It's important to use domain knowledge and make informed decisions based on the context of your analysis.

You can apply these techniques to specific columns or the entire dataset, depending on the specific requirements of your analysis. Additionally, Pandas provides various methods and parameters to handle missing values, such as fillna(), dropna(), and interpolate().

Before applying any data cleaning operations, it's essential to understand the nature of missing values, their potential causes, and the implications of different approaches to handling them.


In [None]:
## Now lets check if we have any duplicates
df.duplicated().sum()

In [None]:
## Checking for unique values for 'type' column
df['type'].unique()

In [None]:
## Checking value counts for 'type' column
df['type'].value_counts()

In [None]:
## We can also plot a countplot or a pie chart to do the same

# Countplot
sns.countplot(x='type', data=df)
plt.show()

In [None]:
# Set a visually appealing color palette
sns.set_palette("pastel")

# Set a white background
sns.set_style("whitegrid")

# Create the countplot
ax = sns.countplot(x='type', data=df)

# Add labels and title
plt.xlabel("Type")
plt.ylabel("Count")
plt.title("Distribution of Types")

# Customize the plot aesthetics
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# Display the plot
plt.show()

In [None]:
# Calculate the count of each unique value in the 'type' column
type_counts = df['type'].value_counts()

# Create a white background for the plot
plt.figure(facecolor='white')

# Create a pie chart
plt.pie(type_counts, labels=type_counts.index, autopct='%1.1f%%', startangle=90)

# Add a title to the chart
plt.title('Distribution of Types')

# Set the aspect ratio to be equal to display a circular pie chart
plt.axis('equal')

# Display the chart
plt.show()

In [None]:
## Now lets look at another column release year
df['release_year'].unique()

In [None]:
## Lets create a histogram to visualize distributuion of 'release_year' column
sns.histplot(df['release_year'], bins=30, kde=True)

In [None]:
# Set a visually appealing color palette
sns.set_palette("deep")

# Set a white background with grid lines
sns.set_style("darkgrid")

# Create the histogram plot
sns.histplot(df['release_year'], bins=30, kde=True)

# Add labels and title
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.title('Distribution of Release Years')

# Customize the plot aesthetics
plt.xticks(rotation=45)

# Display the plot
plt.show()

In [None]:
## Lets look at the earliest and the lates release years
earliest_year = df['release_year'].min()
latest_year = df['release_year'].max()
print('Earliest Release Year:', earliest_year)
print('Latest Release Year:', latest_year)

##### Exercise
Using the above dataset, create a bar plot to visualize the distribution of content by country for the top 10 countries with the most content.

Instructions:

* Calculate the count of content by country and select the top 10 countries
* Create a barplot using sns.barplot()
* Label both the axes
* Add a tilte
* Interpret the bar plot and identify the top 3 countries with the highest number of content.

In [None]:
## Write your exercise code below


