# Table of Contents

1. [Objective](#objective)
2. [Hypothesis](#hypothesis)
3. [Tools and Libraries](#tools_and_libraries)
    - [Pandas](#11_importing_libraries)
    - [Numpy](#11_importing_libraries)
    - [Matplotlib](#11_importing_libraries)
4. [Project Structure](#project_structure)
    1. [Sourcing and Loading](#1_sourcing_and_loading)
        - [Documentation for Libraries](#documentation_for_libraries)
        - [Importing Libraries](#1.1_importing_libraries)
        - [Loading the Data](#1.2_loading_the_data)
    2. [Cleaning, Transforming, and Visualizing](#2_cleaning_transforming_and_visualizing)
        - [Exploring the Data](#2.1_exploring_the_data)
        - [Check Missing Values](#2.2_check_missing_values)
        - [Convert Date Column to Datetime](#2.3_convert_date_column_to_datetime)
        - [Make a Copy of the Data](#2.4_make_a_copy_of_the_data)
        - [Calculate Daily Returns](#2.5_calculate_daily_returns)
        - [Filter Relevant Columns](#2.6_filter_relevant_columns)
        - [Resample Data](#2.7_resample_data)
        - [Convert Average Volume to Millions](#2.8_convert_average_volume_to_millions)
    3. [Exploratory Data Analysis (EDA)](#3_exploratory_data_analysis_eda)
        - [Visualize Time Series Data](#3.1_visualize_time_series_data)
        - [Explore Daily Returns](#3.2_explore_daily_returns)
    4. [Further Analysis](#4_further_analysis)
        - [Statistical Analysis](#4.1_statistical_analysis)
        - [Correlation Analysis](#4.2_correlation_analysis)
        - [Visualization of Correlations](#4.3_visualization_of_correlations)



# Title 

## Objectives 

***Hypothesis?***


Following Applications will be used: 
- **pandas**
    - **data ingestion and inspection** 
    - **exploratory data analysis** 
    - **tidying and cleaning** 
    - **transforming DataFrames** 
    - **subsetting DataFrames with lists** 
    - **filtering DataFrames** 
    - **grouping data** 
    - **melting data** 
    - **advanced indexing** 
- **matplotlib** 
- **fundamental data types** 
- **dictionaries** 
- **handling dates and times** 
- **function definition** 
- **default arguments, variable length, and scope** 
- **lambda functions and error handling** 

### 1. Sourcing and Loading 

The documentation for libraries can be found here:
* [Pandas](https://pandas.pydata.org/)
* [Numpy](http://www.numpy.org/) 
* [Matplotlib](https://matplotlib.org/) 


#### 1.1 Importing Libraries<a id='1.1_importing_libraries'></a>

In [3]:
# Import the pandas, numpy libraries as pd, and np respectively. 
import pandas as pd
import numpy as np

# Load the pyplot collection of functions from matplotlib, as plt 
import matplotlib.pyplot as plt 

#### 1.2.  Loading the data<a id='1.2_loading_the_data'></a>

The data comes from the [Kaggle](https://www.kaggle.com/datasets/abhimaneukj/tesla-inc-tsla-dataset?resource=download): a free, open-source data-sharing portal with a massive range of datasets.

In [4]:
#retrieve the file path from folder
file_path = 'TSLA.csv'

# Load the CSV file into a pandas DataFrame
df = pd.read_csv(file_path)

**Commentary:** Importing libraries sets the foundation for data manipulation and visualization. Loading the data is the initial step, allowing us to work with the raw dataset in subsequent sections.

### 2. Cleaning, transforming, and visualizing<a id='2._Cleaning_transforming_and_visualizing'></a>

#### 2.1. Exploring the data<a id='2.1_exploring_the_data'></a>

In [8]:
# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2010-07-01,5.0,5.184,4.054,4.392,4.392,41094000
1,2010-07-02,4.6,4.62,3.742,3.84,3.84,25699000
2,2010-07-06,4.0,4.0,3.166,3.222,3.222,34334500
3,2010-07-07,3.28,3.326,2.996,3.16,3.16,34608500
4,2010-07-08,3.228,3.504,3.114,3.492,3.492,38557000


In [9]:
# Check the data types and missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2843 entries, 0 to 2842
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       2843 non-null   object 
 1   Open       2843 non-null   float64
 2   High       2843 non-null   float64
 3   Low        2843 non-null   float64
 4   Close      2843 non-null   float64
 5   Adj Close  2843 non-null   float64
 6   Volume     2843 non-null   int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 155.6+ KB


In [10]:
# Display basic summary statistics
df.describe()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
count,2843.0,2843.0,2843.0,2843.0,2843.0,2843.0
mean,105.868475,108.03137,103.555733,105.924597,105.924597,31415240.0
std,188.738974,192.483055,184.638617,188.836358,188.836358,28418800.0
min,3.228,3.326,2.996,3.16,3.16,592500.0
25%,10.698,11.026,10.42,10.727,10.727,12510500.0
50%,45.874001,46.493999,45.102001,45.916,45.916,24815000.0
75%,65.021,66.251999,64.015001,65.275002,65.275002,40120250.0
max,891.380005,900.400024,871.599976,883.090027,883.090027,304694000.0


In [11]:
# Check out the data types:
df.dtypes

Date          object
Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume         int64
dtype: object

#### 2.2 Check Missing Values<a id='2.2_check_missing_values'></a>


In [12]:
print(df.isnull().sum())

Date         0
Open         0
High         0
Low          0
Close        0
Adj Close    0
Volume       0
dtype: int64


#### 2.3 Convert Date Column to Datetime<a id='2.3_convert_date_column_to_datetime'></a>


In [13]:
df['Date'] = pd.to_datetime(df['Date'])

# Create Time-Based Indices:
df.set_index('Date', inplace=True)



In [14]:
df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-07-01,5.0,5.184,4.054,4.392,4.392,41094000
2010-07-02,4.6,4.62,3.742,3.84,3.84,25699000
2010-07-06,4.0,4.0,3.166,3.222,3.222,34334500
2010-07-07,3.28,3.326,2.996,3.16,3.16,34608500
2010-07-08,3.228,3.504,3.114,3.492,3.492,38557000


#### 2.4 Make a copy of the data<a id='2.4_make_a_copy_of_the_data'></a>

In [15]:
# Create a copy of the DataFrame
df_copy = df.copy()

#### 2.5 Filter Relevant Columns<a id='2.5_filter_relevant_columns'></a>
Select the columns that are relevant for your analysis.

In [16]:
selected_columns = ['Volume']
df1 = df_copy[selected_columns]
df1.describe()

Unnamed: 0,Volume
count,2843.0
mean,31415240.0
std,28418800.0
min,592500.0
25%,12510500.0
50%,24815000.0
75%,40120250.0
max,304694000.0


#### 2.6 Resample Data<a id='2.6_resample_data'></a>
For time series, consider resampling the data to a different frequency (e.g., weekly or monthly).


In [18]:
df_resampled = df.resample('M').mean()  # Example: resampling to monthly frequency
# Rename the columns
df_resampled = df_resampled.rename(columns={'Open': 'Average Open', 'High': 'Average High', 'Low': 'Average Low', 'Close': 'Average Close', 'Adj Close': 'Average Adj Close', 'Volume': 'Average Volume'})

# Rename the index
df_resampled = df_resampled.rename_axis('Month')

df_resampled.head()

Unnamed: 0_level_0,Average Open,Average High,Average Low,Average Close,Average Adj Close,Average Volume
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-07-31,4.014667,4.128,3.763238,3.911619,3.911619,15375190.0
2010-08-31,3.909091,3.982,3.816091,3.902182,3.902182,3417773.0
2010-09-30,4.15581,4.255238,4.06181,4.148095,4.148095,4296643.0
2010-10-31,4.144667,4.198571,4.085143,4.142667,4.142667,1559000.0
2010-11-30,5.717429,5.974,5.545714,5.808381,5.808381,6741690.0


In [19]:
# Assuming 'df_resampled' is your DataFrame with 'Average Volume'
df_resampled['Average Volume (Millions)'] = df_resampled['Average Volume'] / 1_000_000

# Drop the original 'Average Volume' column if needed
df_resampled = df_resampled.drop(columns=['Average Volume'])

#
df_resampled.head()





Unnamed: 0_level_0,Average Open,Average High,Average Low,Average Close,Average Adj Close,Average Volume (Millions)
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-07-31,4.014667,4.128,3.763238,3.911619,3.911619,15.37519
2010-08-31,3.909091,3.982,3.816091,3.902182,3.902182,3.417773
2010-09-30,4.15581,4.255238,4.06181,4.148095,4.148095,4.296643
2010-10-31,4.144667,4.198571,4.085143,4.142667,4.142667,1.559
2010-11-30,5.717429,5.974,5.545714,5.808381,5.808381,6.74169


#### 2.7 Calculate Daily Returns<a id='2.7_calculate_daily_returns'></a>
Derive additional features or transformations that might be useful, such as daily returns.

In [20]:
df_copy['daily_return'] = df_copy['Adj Close'].pct_change()
df_copy.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,daily_return
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2010-07-01,5.0,5.184,4.054,4.392,4.392,41094000,
2010-07-02,4.6,4.62,3.742,3.84,3.84,25699000,-0.125683
2010-07-06,4.0,4.0,3.166,3.222,3.222,34334500,-0.160937
2010-07-07,3.28,3.326,2.996,3.16,3.16,34608500,-0.019243
2010-07-08,3.228,3.504,3.114,3.492,3.492,38557000,0.105063


#### 2.7 Delta<a id='2.8_Delta'></a>
.....

**Commentary:** Cleaning and transforming the data are crucial steps for ensuring its quality and suitability for analysis. Visualization aids in uncovering patterns and trends, guiding us toward more focused exploration.

In [15]:
# Group by year and calculate summary statistics
summary_statistics_by_year = df.groupby(df.index.year).agg({
    'Open': ['mean', 'std', 'min', 'max'],
    'Close': ['mean', 'std', 'min', 'max'],
    'Volume': ['mean', 'std', 'min', 'max']
})

# Rename the columns for clarity
summary_statistics_by_year.columns = [f'{col[0]}_{col[1]}' for col in summary_statistics_by_year.columns]

# Display the summary statistics by year
print("Summary Statistics by Year:")
print(summary_statistics_by_year)

Summary Statistics by Year:
       Open_mean    Open_std    Open_min    Open_max  Close_mean   Close_std  \
Date                                                                           
2010    4.684766    1.019008    3.228000    7.174000    4.666750    1.024196   
2011    5.364397    0.556383    4.356000    6.926000    5.360952    0.570726   
2012    6.240624    0.523362    5.324000    7.638000    6.233720    0.545828   
2013   20.883286   10.615145    6.616000   38.792000   20.880246   10.605334   
2014   44.683079    5.957028   28.100000   57.534000   44.665817    5.859630   
2015   45.966389    4.750339   37.166000   56.040001   46.008580    4.754049   
2016   42.011690    4.280572   28.464001   53.290001   41.953452    4.273895   
2017   62.859243    8.342982   42.950001   77.337997   62.863259    8.193263   
2018   63.436693    5.751053   50.556000   75.000000   63.461984    5.752044   
2019   54.605627   10.469095   36.220001   87.000000   54.706040   10.606053   
2020  289.10

#### 3 Exploratory Data Analysis (EDA)<a id='3_exploratory_data_analysis_eda'></a>

#### 3.1 Visualize Time Series Data<a id='3.1_visualize_time_series_data'></a>

* **Line Plots:** Plotting the time series data to visualize the trend in key variables (e.g., stock prices over time).
* **Moving Averages:** Calculating and plotting moving averages to smooth out fluctuations.


#### 3.2 Explore Daily Returns

* **Histograms:** Visualizing the distribution of daily returns.
* **Descriptive Statistics:** Calculating and visualizing summary statistics for daily returns.
*  **Cumulative Returns:** Plotting the cumulative returns over time.