# US Stock Market Prediction (2014-2018)
**Authors:** David PAGNIEZ, Antoine KRYCHOWSKI, Gregoire MEHAH

## 1. Project Objective
The goal of this project is to analyze financial indicators of US companies from 2014 to 2018 and build a machine learning model to predict stock performance.

Our approach follows these steps:
1.  **Data Cleaning:** Handling missing values and outliers in financial data.
2.  **Baseline Model:** Attempting to predict the exact price variation (Regression).
3.  **Classification:** Switching to a trend prediction (Buy/Sell) if regression fails.
4.  **Optimization:** Using Feature Engineering and Ensemble methods to improve results.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PowerTransformer

# Visual configuration
plt.style.use("ggplot")
%matplotlib inline

# 1. Loading Data
# We iterate through the CSV files to merge them into a single dataframe.
years = [2014, 2015, 2016, 2017, 2018]
dfs = []

print("Loading data...")
for year in years:
    try:
        filename = f'{year}_Financial_Data.csv'
        df = pd.read_csv(filename)
        df['Year'] = year
        dfs.append(df)
        print(f"Loaded {filename}")
    except FileNotFoundError:
        print(f"Error: {filename} not found.")

data = pd.concat(dfs, ignore_index=True)

# 2. Target Consolidation
# The target column name is not consistent across files, so we merge them.
target_cols = [col for col in data.columns if 'PRICE VAR' in col]
data['Target_Price_Var'] = data[target_cols].bfill(axis=1).iloc[:, 0]
data['Target_Price_Var'] = pd.to_numeric(data['Target_Price_Var'], errors='coerce')

# Dropping useless columns
cols_to_drop = target_cols + ['Class', 'Unnamed: 0', 'Symbol']
data = data.drop(columns=[c for c in cols_to_drop if c in data.columns])

print(f"Initial Dataset Shape: {data.shape}")


Loading data...
Loaded 2014_Financial_Data.csv
Loaded 2015_Financial_Data.csv
Loaded 2016_Financial_Data.csv
Loaded 2017_Financial_Data.csv
Loaded 2018_Financial_Data.csv
Initial Dataset Shape: (22077, 225)


  data['Target_Price_Var'] = data[target_cols].bfill(axis=1).iloc[:, 0]


## 2. Preprocessing and Outlier Management

Financial data is known to be very noisy. Some companies are massive (Apple, Google), while others are small. Additionally, some data points seem to be errors (e.g., a 10,000% increase).

If we leave these outliers, they will bias the model.
Strategy: We apply Winsorization (Clipping). We cap the values at the 1st and 99th percentiles to remove extreme outliers without deleting too much data.


In [3]:
# 1. Remove rows where the target is missing
data_clean = data.dropna(subset=['Target_Price_Var']).copy()

# 2. Remove columns with too many missing values (> 40%)
threshold = 0.4
data_clean = data_clean.dropna(thresh=len(data_clean) * (1 - threshold), axis=1)

# 3. Winsorization (Clipping)
# We limit the Target between -99% (almost bankruptcy) and +500% (x5)
data_clean['Target_Price_Var'] = data_clean['Target_Price_Var'].clip(lower=-99, upper=500)

# We limit the Features between the 1st and 99th quantile
numeric_cols = data_clean.select_dtypes(include=np.number).columns.drop(['Year', 'Target_Price_Var'])
lower_bounds = data_clean[numeric_cols].quantile(0.01)
upper_bounds = data_clean[numeric_cols].quantile(0.99)

data_clean[numeric_cols] = data_clean[numeric_cols].clip(lower=lower_bounds, upper=upper_bounds, axis=1)

print("Preprocessing done. Outliers have been clipped.")


Preprocessing done. Outliers have been clipped.


## 3. First Approach: Linear Regression

Our first hypothesis was to predict the exact percentage of price variation.
We used a standard Linear Regression model.

*   Train set: 2014, 2015, 2016
*   Test set: 2018 (to simulate real future prediction)


In [6]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

# Temporal Split
train = data_clean[data_clean['Year'].isin([2014, 2015, 2016])]
test = data_clean[data_clean['Year'] == 2018]

# --- CORRECTION ICI : On ne garde que les chiffres ---
# .select_dtypes(include=np.number) vire automatiquement les colonnes de texte (Secteur, etc.)
X_train = train.drop(columns=['Target_Price_Var', 'Year']).select_dtypes(include=np.number)
y_train = train['Target_Price_Var']

X_test = test.drop(columns=['Target_Price_Var', 'Year']).select_dtypes(include=np.number)
y_test = test['Target_Price_Var']

# Pipeline: Imputation -> Scaling -> Regression
reg_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')), # Maintenant ça marche car tout est numérique
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

reg_pipeline.fit(X_train, y_train)
preds_reg = reg_pipeline.predict(X_test)

print("Linear Regression Results:")
print(f"R2 Score: {r2_score(y_test, preds_reg):.4f}")
print(f"Mean Absolute Error (MAE): {mean_absolute_error(y_test, preds_reg):.2f}%")


Linear Regression Results:
R2 Score: -0.1006
Mean Absolute Error (MAE): 38.91%
