<div style="text-align: center;">
    <div style="font-size: 28px;"><strong>Stock Market Analysis and Prediction Project</strong></div>
    <div style="font-size: 15px;">Author: Daniel A, Daniel B, David</div>
</div>

<div style="text-align: center;">
    <img src="Bullimage.jpg">
</div>


**<span style="font-size: 1.6em">Table of Contents</span>** 

1. [Introduction](#introduction)
2. [Overview](#overview)

4. [Data Curation](#data-curation)
   - [Datasets](#datasets)
   - [Why We Are Choosing These Datasets](#why-we-are-choosing-these-datasets)

5. [Exploratory Data Analysis](#exploratory-data-analysis)
   - [Data Preprocessing](#data-preprocessing)
   - [Importing and Parsing Data](#importing-and-parsing-data)
   - [Organizing Data](#organizing-data)
   - [Basic Data Exploration and Summary Statistics](#basic-data-exploration-and-summary-statistics)
   - [Main Characteristics](#main-characteristics)
   - [Identifying Key Attributes](#identifying-key-attributes)

6. [Hypothesis Testing and Statistical Analysis](#hypothesis-testing-and-statistical-analysis)
   - [Chi-Squared Test](#chi-squared-test)
   - [Z Test](#z-test)
   - [T Test](#t-test)
   - [Mann-Whitney U Test](#mann-whitney-u-test)
   - [ANOVA](#anova)
   - [Presenting Conclusions and Visualizations](#presenting-conclusions-and-visualizations)
     - [Conclusion 1: Correlation Analysis](#conclusion-1-correlation-analysis)
     - [Conclusion 2: Outlier Detection](#conclusion-2-outlier-detection)
     - [Conclusion 3: Hypothesis Testing](#conclusion-3-hypothesis-testing)

7. [Predictive Modeling](#predictive-modeling)
   - [Model Selection](#model-selection)
   - [Model Implementation](#model-implementation)
   - [Model Evaluation](#model-evaluation)

8. [Visualization](#visualization)
   - [Explanation of Results](#explanation-of-results)
   - [Plots and Graphs](#plots-and-graphs)

9. [Insights and Conclusions](#insights-and-conclusions)
   - [Summary of Findings](#summary-of-findings)
   - [Implications and Recommendations](#implications-and-recommendations)



<div style="text-align: center;">
    <span style="font-size: 1.6em; font-weight: bold">Data Collection</span>
</div>


The stock market is a dynamic and complex system that plays a crucial role in the global economy. Understanding stock market trends and predicting future stock prices are essential for investors, policymakers, and financial analysts. This project aims to analyze historical stock market data, identify significant trends, and develop predictive models to forecast future stock prices.

By examining data from various sectors, including major financial services and banks, top tech companies, and key stock market indices, we aim to gain a deeper understanding of how different factors influence stock prices. Our analysis will cover periods of economic stability as well as major financial events, such as the 2008 financial crisis and the COVID-19 pandemic. This comprehensive analysis will help us uncover patterns, correlations, and anomalies that can inform investment strategies and economic policies.

The main objectives of this project are:

    *****Behavioral Analysis:*****   Track and analyze stock price behavior over time to identify trends and patterns.
    *****Event Impact Analysis:***** Examine the effects of significant events on stock prices, including financial crises, political changes, and economic policies.
    *****Predictive Modeling: *****  Develop models to predict future stock prices using historical data and advanced machine learning techniques.

***Data Curation***
FSCs-Banks (2006-2020):

    Source: Kaggle
    URL: https://www.kaggle.com/datasets/dgawlik/nyse
    Contains stock information (high, low, close, etc.) for major financial services and banks from January 1, 2006, to November 1, 2020.

Stock Market Indices (2006-2020):

    Source: Kaggle
    URL: https://www.kaggle.com/datasets/borismarjanovic/price-volume-data-for-all-us-stocks-etfs
    Provides data on the top five stock market indices from 2006 to 2020.

Stock Market Indices (2014-2024):

    Source: Kaggle
    URL: https://www.kaggle.com/datasets/jacksoncrow/stock-market-dataset
    Extends the coverage of stock market indices data from 2014 to 2024.

Tech Companies (2006-2020):

    Source: Kaggle
    URL: https://www.kaggle.com/datasets/szrlee/stock-time-series-20050101-to-20171231
    Includes stock information for the top tech companies from January 1, 2006, to November 1, 2020.


***Why We Are Choosing These Datasets***
****Historical Coverage:**** These datasets provide extensive historical data from 2006 to 2024, covering key periods such as the 2008 financial crisis and the COVID-19 pandemic.
****Diverse Sectors:**** By including financial services, banks, tech companies, and major market indices, we can perform a comprehensive analysis across different sectors.
****Event Impact Analysis:**** These datasets allow us to examine the effects of significant events (e.g., financial crises, political changes) on stock prices, offering valuable insights into market behavior.

**<span style="font-size: 1.6em">Data Manipulation and Analysis</span>**

- **pandas**: For data manipulation and analysis.
- **numpy**: For numerical operations.

**<span style="font-size: 1.6em">Data Visualization</span>**

- **matplotlib**: For creating static, interactive, and animated visualizations.
- **seaborn**: For statistical data visualization, built on top of matplotlib.
- **plotly**: For interactive visualizations.

**<span style="font-size: 1.6em">Statistical Analysis</span>**

- **scipy**: For statistical tests and scientific computing.
- **statsmodels**: For statistical modeling and hypothesis testing.

**<span style="font-size: 1.6em">Machine Learning</span>**

- **scikit-learn**: For machine learning algorithms and tools.

**<span style="font-size: 1.6em">Additional Libraries</span>**

- **yfinance**: For fetching historical market data from Yahoo Finance
- **requests**: For making HTTP requests to download data
- **datetime**: For manipulating dates and times


In [4]:
#Data Manipulation and Analysis
import pandas as pd
import numpy as np
import os

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Statistical Analysis
import scipy.stats as stats
import statsmodels.api as sm

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Additional Libraries
import yfinance as yf
import requests
from datetime import datetime


**<span style="font-size: 1.6em">Exploratory Data Analysis</span>**

**Data Preprocessing**

Data preprocessing is a crucial step in any data analysis project. It involves transforming raw data into a format that is suitable for analysis. This includes importing data, parsing and converting data types, handling missing values, and organizing the data into a structured format such as a pandas DataFrame.

**Importing and Parsing Data**

The first step in data preprocessing is to import the dataset into your Python environment and parse any necessary columns. This involves loading the data from a CSV file.


In [2]:
import pandas as pd
import os


directory_path = '/Users/danielberhane/Desktop/CMSC320/CMSC320 Final_Project/CMSC320-Final_Group_project'


dataframes = []

# Iterate over each file in the directory
for filename in os.listdir(directory_path):
    if filename.endswith(".csv"):
        file_path = os.path.join(directory_path, filename)
        
        # Load the CSV file
        df = pd.read_csv(file_path)
        
        # Parse necessary columns 
        if 'Date' in df.columns:
            df['Date'] = pd.to_datetime(df['Date'])
        
        # Append the DataFrame to the list
        dataframes.append(df)

# Concatenate all DataFrames into a single DataFrame
combined_data = pd.concat(dataframes, ignore_index=True)

# Display the first few rows to understand the structure of the combined data
print(combined_data.head())


  Symbol     Security             GICS Sector               GICS Sub-Industry  \
0    MMM           3M             Industrials        Industrial Conglomerates   
1    AOS  A. O. Smith             Industrials               Building Products   
2    ABT       Abbott             Health Care           Health Care Equipment   
3   ABBV       AbbVie             Health Care                   Biotechnology   
4    ACN    Accenture  Information Technology  IT Consulting & Other Services   

     Headquarters Location  Date added        CIK      Founded Exchange  \
0    Saint Paul, Minnesota  1957-03-04    66740.0         1902      NaN   
1     Milwaukee, Wisconsin  2017-07-26    91142.0         1916      NaN   
2  North Chicago, Illinois  1957-03-04     1800.0         1888      NaN   
3  North Chicago, Illinois  2012-12-31  1551152.0  2013 (1888)      NaN   
4          Dublin, Ireland  2011-07-06  1467373.0         1989      NaN   

  Shortname  ... Fulltimeemployees Longbusinesssummary Weight 

In [None]:
**Cklean the data**

In [15]:
# Parse necessary columns 
if 'Date' in combined_data.columns:
    combined_data['Date'] = pd.to_datetime(combined_data['Date'])
elif 'date' in combined_data.columns:
    combined_data['date'] = pd.to_datetime(combined_data['date'])
    combined_data.rename(columns={'date': 'Date'}, inplace=True)

# Handle missing values numeric
numeric_columns = combined_data.select_dtypes(include=[np.number]).columns
combined_data[numeric_columns] = combined_data[numeric_columns].fillna(combined_data[numeric_columns].mean())

# Handle missing values  non-numeric columns
non_numeric_columns = combined_data.select_dtypes(exclude=[np.number]).columns
combined_data[non_numeric_columns] = combined_data[non_numeric_columns].fillna('Unknown')

# Display summary verify the changes
print(combined_data.describe())


  combined_data['date'] = pd.to_datetime(combined_data['date'])


DateParseError: Unknown datetime string format, unable to parse: Unknown, at position 0

In [None]:
***Organizing data ***

  Symbol     Security             GICS Sector               GICS Sub-Industry  \
0    MMM           3M             Industrials        Industrial Conglomerates   
1    AOS  A. O. Smith             Industrials               Building Products   
2    ABT       Abbott             Health Care           Health Care Equipment   
3   ABBV       AbbVie             Health Care                   Biotechnology   
4    ACN    Accenture  Information Technology  IT Consulting & Other Services   

     Headquarters Location  Date added        CIK      Founded Exchange  \
0    Saint Paul, Minnesota  1957-03-04    66740.0         1902  Unknown   
1     Milwaukee, Wisconsin  2017-07-26    91142.0         1916  Unknown   
2  North Chicago, Illinois  1957-03-04     1800.0         1888  Unknown   
3  North Chicago, Illinois  2012-12-31  1551152.0  2013 (1888)  Unknown   
4          Dublin, Ireland  2011-07-06  1467373.0         1989  Unknown   

  Shortname  ... Fulltimeemployees Longbusinesssummary    Weig

KeyError: 'Closing_Price'