# Exploring Data with Python & GitHub
Let us have an interactive session as we go through a sample of the project, please feel free to ask any questions about the project as we wind down the Python sprint.
## Objectives
*    Have a walk-through of the basic structure of the Python project.
*    Identify some challenges faces during the project and how to overcome them.

## IMPORTS
Import all the required libraries for your analysis project

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

## LOAD DATA 

We will look at multiple options to import the data using pandas:
*    Local CSV
*    Github Raw file

### Local CSV
Pass the location of the file relative to your notebook

In [None]:
df = pd.read_csv('data/data.csv')

### Load directly from a GitHub raw CSV
Navigate to github and find the url for the raw file on your repository.
*    Allows you to capture the latest data without updating the entire repository.
*    You require an internet connection.

In [None]:
# url = ''
# df = pd.read_csv(url)

## Preview the data

In [None]:
df

In [None]:
df.head(20)

## Basic information and cleaning

### Check all data types and non-null counts for each column

In [None]:
df.info()

In [None]:
len(df)

In [None]:
df.columns

### Display summary statistics for numeric columns

In [None]:
df.describe()

### Convert *date* column to datetime if it isn't automatically picked up by pandas

In [None]:
df['date'] = pd.to_datetime(df['date'])

In [None]:
df.info()

In [None]:
df.columns

### Sort the data chronologically

In [None]:
df = df.sort_values('date')

### Set the date as the index for further analysis

In [None]:
df.set_index('date', inplace=True)

## Upload to Github

Lets push our changes to GitHub for another member of the team to proceed with the visualisation and analysis

## Visualisation and Analysis

### You can start your analysis with visuals of the columns you are interested in

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(df['close'], label='Close Price')
plt.title('Stock Close Price Over Time')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

#### [commentary]

### Create a new column from existing data for analysis

Let us calculate the percentage change between each recorded period

In [None]:
df['return'] = df['close'].pct_change()

In [None]:
df.head(5)

In [None]:
df = df.dropna()

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(df['return'], label='Return')
plt.title('Stock Returns Over Time')
plt.xlabel('Date')
plt.ylabel('Percentage change')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

#### [commentary]

In [None]:
df['return'].plot(kind='hist', bins=100, figsize=(8, 4), title='Returns Distribution')
plt.xlabel('Daily Return')
plt.tight_layout()
plt.show()

#### [commentary]

## Hypothesis Testing [Optional]

**Question: *Are the returns symmetrical for first and second halves of the year?***

*    $H_{0}$ :
*    $H_{1}$ : 

### Split the data

In [None]:
total_rows = len(df)
midpoint = total_rows // 2
returns_first_half = df['return'].iloc[:midpoint]
returns_second_half = df['return'].iloc[midpoint:]

In [None]:
returns_first_half

### Perform the T-test

In [None]:
stat, p = stats.ttest_ind(returns_first_half, returns_second_half, equal_var=False)

In [None]:
stat

### Results

In [None]:
print(f"T-test statistic: {stat:.4f}, p-value: {p:.4f}")
if p < 0.05:
    print("Statistically significant difference in returns between halves.")
else:
    print("No statistically significant difference in returns between halves.")


## Conclusion and recommendation

### Conclusion
*    
*    
*    
### Recommendations[Optional]
*    