<a href="https://colab.research.google.com/github/CommunityRADvocate/ida3-fall24-colabs/blob/main/Week_11_Data_Viz.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GCode Intro to Data Analytics
## Week 11 - Matplotlib, Data visualization, and Pandas

Welcome to the Python activities for this week! These activities have been designed to test and enhance your programming skills. In each challenge, you'll work on specific tasks, developing appropriate code structures.

The objective of this exercise is to explore the new syntax and functions related to the matplotlib library that you learned throughout this week's track and to practice these concepts in real-world scenarios.

# Importing the Libraries



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Activity 1. Analyzing Climate Data - Temperature and Precipitation Patterns

Welcome to a hands-on journey into climate data analysis! In this activity, we'll use Matplotlib to investigate temperature and precipitation patterns. You'll learn how to visualize and interpret climate data, gaining valuable insights into our planet's climate. Let's dive in and discover what the data reveals about temperature and precipitation trends.

<br>

First, write a code that allows you to access the information contained in the csv file in the form of a pandas dataframe.

Remember that each time you exit the collab you must upload the csv file again.



```
df = pd.read_csv("File_name")
```



In [None]:
  df = pd.read_csv('https://raw.githubusercontent.com/nicoleenos/IDA-Pilot-23/main/Activities%20Week%2011/Prec_data.csv')

  df.head()

Unnamed: 0,Date,Avg_Temperature_C,Total_Precipitation_mm
0,2022-01-01,3.5,10.2
1,2022-01-02,4.2,8.7
2,2022-01-03,5.1,5.5
3,2022-01-04,6.2,3.0
4,2022-01-05,7.3,2.8


To get a better insight into how data visualization tools operate, we will work with the most common chart types that belong to the Matplotlib library:

## Task A: Line Graph

 Make a line graph in which you can visualize how the average temperature and total precipitation values change over time. To do this, take only the data for the month of January.

 [Documentation for Line Graph](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html)

 <br>

 Feel free to use the structure below to organize your code but it is not necessary. You can delete it and do it as you feel more comfortable.

In [None]:
jan = df.iloc[0:31]
jan

Unnamed: 0,Date,Avg_Temperature_C,Total_Precipitation_mm
0,2022-01-01,3.5,10.2
1,2022-01-02,4.2,8.7
2,2022-01-03,5.1,5.5
3,2022-01-04,6.2,3.0
4,2022-01-05,7.3,2.8
5,2022-01-06,8.2,1.9
6,2022-01-07,8.9,0.7
7,2022-01-08,9.7,0.0
8,2022-01-09,9.8,0.0
9,2022-01-10,10.5,0.0


In [None]:
# Extract data for the month of January
jan = df.iloc[0:31]
# Convert the 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# Filter data to only include records from January
df_january = df[df['Date'].dt.month == 1]


# Extract the date, average temperature, and total precipitation columns

# Create a figure and set its size

# Create a line plot for average temperature

# Create a line plot for total precipitation

# Add a title and labels to the axes

# Add a legend to the chart

# Rotate the x-axis labels for better readability (It could help but it is not that important )
 # plt.xticks(rotation = 90)

# Show the plot

Be sure to add style to the graphic, don't forget to:
* Put a title to the figure and name the axis
* Change the color of the lines
* Choose the type of trace for each line
* Add the legend of the chart
* Change the size of the figure and so on

  ## Task B: Scatter Plot
  
  Make a scatter plot between the **average temperature** and the **total  precipitation** in the whole dataset to verify if there is any relationship between these two variables. Again, do not forget to customize the graph with the characteristics mentioned above.  

   [Documentation for Scatter Plot](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html)

In [None]:
# Extract average temperature and total precipitation

# Create a scatter plot

# Add a title and labels to the axes

# Add a legend

# Customize the grid

# Show the plot

Question: Do you think it is possible to find a correlation by eye between the variables?

<br>

**Bonus:** Try to calculate the value of the correlation coefficient between the two variables, and conclude about the result.

Remeber that it is a number between -1 and 1 that indicates how much two variables are related.

  * If it is 1, it is a perfect positive relationship
  * If it is -1 it is a perfect negative relationship
  * If it is 0, there is no relationship at all

```
corr_coef = df['Variable_1'].corr(df['Variable_2'])
```



##Task C: Create a Histogram for Temperature Data

In this task, you will create a histogram for the temperature data in the DataFrame. *Pay attention to the number of bins required for an appropriate display of the data.* You can also consider creating subplots with different values for this parameter if you find it suitable, but it is not mandatory.

<br>
Customize the graph according to your preferences by modifying the parameters as desired.

[Documentation for Subplots](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html)

[Documentation for Histograms](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html)

<br>

Feel free to use the structure below to organize your code but it is not necessary. You can delete it and do it as you feel more comfortable.

In [None]:
# Extract temperature data

# Create a figure and set its size

# Create a histogram with custom settings

# Add a title and labels to the axes

# Customize the grid

# Show the plot


# Activity 2. Exploring Stock Market Data and Looking for Trends

Welcome to the 'Exploring Stock Market Data' activity. In this exercise, we dive into the world of financial data analysis. By examining historical stock prices, you'll uncover insights that can inform investment decisions. Let's explore the numbers and patterns behind stock performance together.

<br>

First, write a code that allows you to access the information contained in the csv file in the form of a pandas dataframe.

Remember that each time you exit the collab you must upload the csv file again.


```
df = pd.read_csv("File_name")
```



In [None]:
df2 = pd.read_csv('https://raw.githubusercontent.com/nicoleenos/IDA-Pilot-23/main/Activities%20Week%2011/Stocks.csv')

df2.head()

Unnamed: 0,Date,Symbol,Open,High,Low,Close,Volume
0,2022-01-03,AAPL,150.0,152.8,149.5,152.45,12345678
1,2022-01-04,AAPL,152.5,153.2,150.8,152.8,13456789
2,2022-01-05,AAPL,152.8,153.5,151.9,152.25,11234567
3,2022-01-06,AAPL,152.3,153.4,151.6,153.15,9876543
4,2022-01-07,AAPL,153.2,153.6,151.7,153.3,10987654


You can see that the dataframe has different columns, and each one represents the following:

* **Date:** It is the day for which the behavior of the stock was observed.

* **Symbol:** It is the name of the stock AAPL for Apple Inc. and GOOGL for the company Google.

* **Open:** It is the initial price of the stock on the day

* **High:** The maximum price of the day reached by the stock

* **Low:** The lowest price of the day reached by the stock

* **Close:** It is the price at the end of the day

* **Volume:** The total number of shares bought or sold during the day

It is recommended to sort your dataframe by date in ascending order and reset the index in this way. So that data zero matches the first date recorded. To do this you can use the following syntax



```
df2 = df2.sort_values(by='Date')
df2 = df2.reset_index(drop=True)
```



In [None]:
# Sort the DataFrame by the 'date' column in ascending order

# Reset the index, starting from the oldest date

To get a better insight into this information, it is important to use graphs that allow you to observe data trends. To do so, please follow the steps below which indicate which graphs you should make and what you should keep in mind.

## Task A: Line plot

One of the fastest ways to analyze the trend of the data is through a line graph over time, since it allows to observe how the data changes in a specific interval. It also serves to estimate whether the value will increase or decrease in the future, with some accuracy, although this is not always the case.

<br>

To begin with, make a line chart of the stock volume for the month of February, using one line for each company:

* Use a blue color to represent apple shares and a red color to represent google shares.

* Be sure to add style to the graphic, and add titles to the axes


This should help:
```
february_data = df2[(df2['Date'] >= 'Start_Date') & (df2['Date'] <= 'End_Date')]
```



In [None]:
# Filter data for the month of February

# Separate data for Apple and Google

# Create a line chart for Apple's stock volume (blue color)

# Create a line chart for Google's stock volume (red color)

# Customize the chart

# Show the chart

* Which stock do you think has lost the most value?
* Which stock do you think will continue to grow?
* Do you think there may be outliers in any of the curves? How would they affect the results?

## Task B - Data Distribution and Outliers

In this exercise, let's dive into the world of data distribution. We'll be focusing on the closing prices of a chosen company's stock (for example, Apple - AAPL) from our dataset.

<br>


Histograms and boxplots are essential in data analysis for visualizing data distribution, identifying outliers, and making informed modeling decisions. They provide concise summaries, aiding in effective data exploration and communication.

<br>


Your task is to create both a histogram and a boxplot to gain insights into how these closing prices are distributed.

To do so, use the **subplot** tool of matplotlib and plot the two boxes side by side.

<br>

[Documentation on Boxplots](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html)

[Documentation for Subplots](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html)

In [None]:
# Filter data for Apple (AAPL)

# Create subplots with two columns

# Plot a histogram on the first subplot

# Plot a boxplot on the second subplot

# Adjust spacing between subplots
 # plt.tight_layout()

# Show the subplots

* What information can you extract from these graphs?
* Does it make it easier to identify outliers?
* What do the boxplot lines represent?

## Task C - Relation Between Variables

Scatter plots are like maps for data. They help us see how two things are connected by placing dots on a graph. By looking at these dots, we can find patterns or important points in the data, which can be quite useful for making informed decisions.

<br>

To complete this task, create a scatter plot between the closing price of the stock and the volume of the google company. Remember to customize your chart and add names and titles.

In [None]:
# Filter data for Google (GOOGL)

# Create a scatter plot

# Show the scatter plot

* Can you extract a trend from this data?

* Is there any way to relate closing price to volume?

# Activity 3 - Analyzing Books with Pandas

Pandas is one of the most widely used libraries in Python for data analysis due to its efficiency and the large number of functions it has to perform operations with datasets.

<br>

The following activity belongs to the analyzing Books with pandas module. Therefore, the activities in this section will consist of writing snippets to extract specific information from the dataset.

[Treehouse Module](https://teamtreehouse.com/library/analyzing-books-with-pandas)

<br>

First, write a code that allows you to access the information contained in the csv file in the form of a pandas dataframe.

Remember that each time you exit the collab you must upload the csv file again.

# New Section

In [None]:
books = pd.read_csv("books.csv")

books.head()

FileNotFoundError: [Errno 2] No such file or directory: 'books.csv'

The tasks to be carried out are:

A. Find the most popular book(The book with the highest number of reviews) of the autor: "J.K. Rowling"

In [None]:
# Filter the dataset for books by the specified author

# Sort the filtered dataset by ratings_count in descending order

# Get the most popular book (top entry)

B. Find the top 5 highest rated books between the years 2000 and 2010


Remember to change first the 'publication_date' column to a pandas date type, and check the function Timestamp
```
books['publication_date'] = pd.to_datetime(books['publication_date'])

pd.Timestamp()
```

[Treehouse link](https://teamtreehouse.com/library/analyzing-books-with-pandas/popularity)

[Timestamp documentation](https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html)

In [None]:
# Convert the "publication_date" column to datetime

# Filter the dataset for books published between 2000 and 2010

# Sort the filtered dataset by average_rating in descending order

# Retrieve the top 5 highest-rated books

C. Find the book that has the most pages?

In [None]:
# Your code here

D. Find all books published in 2002 before October 31.

In [None]:
# Your code here

E. Find the average number of pages written by the author 'J.K. Rowling' per book

Hint: [You can check this Treehouse video if you need some help](https://teamtreehouse.com/library/analyzing-books-with-pandas/pages)


In [None]:
# Your code here