<a href="https://colab.research.google.com/github/LPS25/ML-Project/blob/main/Laxmi_Priya_Swain_%7C_Stock_Price_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Machine Learning Project On Stock Price Prediction



# **Project Summary -**

Machine learning proves immensely helpful in many industries in automating tasks that earlier required human labor one such application of ML is predicting whether a particular trade will be profitable or not.

In this project, we will learn how to predict a signal that indicates whether buying a particular stock will be helpful or not by using ML.

Let’s start by importing some libraries which will be used for various purposes which will be explained later in this project.

Execute an end-to-end data science project by following the below steps:

Step 1: **Define the Problem Statement**

Understand the industry and categorize the problem type (Supervised, Unsupervised, Semi, etc.).
Comprehend the business objective and desired outcomes.
Identify constraints, limitations, computational power, budget, and data availability.
Determine evaluation metrics for optimization, tracking KPIs, and required testing.
Assess the model's relevancy to the target audience, focusing on prediction speed.
Evaluate data availability and necessary features for collection.
Define the scope of the solution to manage expectations.
Consider deployment options such as cloud platforms, web apps, websites, or APIs.

Step 2: **Data Collection**

Identify reliable sources such as databases, APIs, sensors, or surveys.
Specify the required data volume for effective analysis.
Classify data as labeled or unlabeled based on availability.
Address data quality issues, errors, bias, and consistency.
Ensure data relevancy to the problem being addressed.
Account for temporal effects and changes in the data.
Handle legal and ethical concerns related to data privacy.
Implement sampling strategies and data privacy techniques.
Utilize appropriate tools for data collection.
Implement version control to manage dataset changes.
Consider continuous data collection for improved accuracy.

Step 3: **Data Preprocessing**

Handle missing values using various imputation techniques.
Address outliers using standard deviation or IQR methods.
Encode categorical variables using suitable techniques.
Transform data through standardization, normalization, or other methods.
Handle imbalanced datasets using techniques like oversampling or undersampling.
Reduce dimensionality for better computational efficiency.
Apply techniques to transform data for optimal model performance.

Step 4: **Exploratory Data Analysis (EDA)**

Analyze data distribution using summary statistics and visualizations.
Explore relationships between variables through scatter plots and bar charts.
Study complex interrelationships using heatmaps and pair plots.
Identify temporal patterns and trends.
Visualize categorical data using appropriate charts.
Use PCA for dimensionality reduction and visualization.
Perform statistical and hypothesis tests to validate assumptions.
Visualize complex data types such as text or images.

Step 5: **Model Selection, Training & Evaluation**

Split data into training and testing sets.
Choose suitable algorithms from a library based on the problem.
Select evaluation metrics aligned with the problem domain.
Ensure scalability and efficient processing for larger datasets.
Optimize hyperparameters through techniques like grid search.
Utilize parallel processing and GPU resources for training.
Interpret and explain model decisions using tools like SHAP or LIME.
Address imbalanced data to prevent bias in model performance.
Consider pre-trained models and transfer learning for enhanced training.
Implement early stopping to prevent overfitting.
Save and load models for future use.
Use experiment logging and versioning tools.
Integrate the model into data processing pipelines.
Implement feedback loops for retraining based on updated data.
Explore automated machine learning (AutoML) for model selection and hyperparameter tuning.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Problem Statement:
Imagine yourself as a freelance data scientist ready for the next project adventure. Your task is to select a machine learning project from the list provided or propose an original project idea that resonates with you. Your objective is to identify a specific challenge within the chosen industry domain and design a machine learning solution to address it. Whether you're predicting customer behavior, optimizing processes, or making healthcare more efficient, your project should demonstrate your ability to approach complex problems, preprocess and analyze relevant data, develop and fine-tune models, and interpret results in a meaningful way. Your project will be a testament to your adaptability, curiosity, and aptitude for machine learning.


# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn import metrics

import warnings
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
# Load Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv('/content/drive/MyDrive/TSLA.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

From the first five rows, we can see that data for some of the dates is missing the reason for that is on weekends and holidays Stock Market remains closed hence no trading happens on these days.

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape


From this, we got to know that there are 2416 rows of data available and for each row, we have 7 different features or columns.

In [None]:
df.describe()


### Dataset Information

In [None]:
# Dataset Info
df.info()


# **Exploratory Data Analysis**

EDA is an approach to analyzing the data using visual techniques. It is used to discover trends, and patterns, or to check assumptions with the help of statistical summaries and graphical representations.

While performing the EDA of the Tesla Stock Price data we will analyze how prices of the stock have moved over the period of time and how the end of the quarters affects the prices of the stock.

In [None]:
plt.figure(figsize=(15,5))
plt.plot(df['Close'])
plt.title('Tesla Close price.', fontsize=15)
plt.ylabel('Price in dollars.')
plt.show()


The prices of tesla stocks are showing an upward trend as depicted by the plot of the closing price of the stocks.

In [None]:
df.head()


If we observe carefully we can see that the data in the ‘Close’ column and that available in the ‘Adj Close’ column is the same let’s check whether this is the case with each row or not.

In [None]:
df[df['Close'] == df['Adj Close']].shape


From here we can conclude that all the rows of columns ‘Close’ and ‘Adj Close’ have the same data. So, having redundant data in the dataset is not going to help so, we’ll drop this column before further analysis.

In [None]:
df = df.drop(['Adj Close'], axis=1)


Now let’s draw the distribution plot for the continuous features given in the dataset.

Before moving further let’s check for the null values if any are present in the data frame.

In [None]:
df.isnull().sum()


This implies that there are no null values in the data set provided.

In [None]:
features = ['Open', 'High', 'Low', 'Close', 'Volume']

plt.figure(figsize=(20, 10))

for i, col in enumerate(features):
    plt.subplot(2, 3, i + 1)
    sb.distplot(df[col])

plt.show()


In the distribution plot , we can see two peaks which means the data has varied significantly in two regions. And the Volume data is left-skewed.

In [None]:
features = ['Open', 'High', 'Low', 'Close', 'Volume']

# Create a 2x3 grid of subplots with a specified size
fig, axes = plt.subplots(2, 3, figsize=(20, 10))

# Flatten the axes array for easier indexing
axes = axes.flatten()

for i, col in enumerate(features):
    # Select the subplot for the current feature
    ax = axes[i]

    # Create a boxplot for the current feature
    sb.boxplot(data=df, x=col, ax=ax)

plt.show()

From the above boxplots, we can conclude that only volume data contains outliers in it but the data in the rest of the columns are free from any outlier.

# **Feature Engineering**

Feature Engineering helps to derive some valuable features from the existing ones. These extra features sometimes help in increasing the performance of the model significantly and certainly help to gain deeper insights into the data.

In [None]:
splitted = df['Date'].str.split('/', expand=True)

# Print the structure of the 'splitted' DataFrame
print(splitted)
df.head()

In [None]:
df['month'] = pd.to_datetime(df['Date']).dt.year
df['is_quarter_end'] = np.where(df['month']%3==0,1,0)
df.head()


A quarter is defined as a group of three months. Every company prepares its quarterly results and publishes them publicly so, that people can analyze the company’s performance. These quarterly results affect the stock prices heavily which is why we have added this feature because this can be a helpful feature for the learning model.

In [None]:
# Extract the year from the 'Date' column and create a new 'year' column
df['year'] = pd.to_datetime(df['Date']).dt.year

# Group the data by year and calculate the mean for each year
data_grouped = df.groupby('year').mean()

# Create subplots for 'Open', 'High', 'Low', and 'Close' columns
plt.figure(figsize=(20, 10))

for i, col in enumerate(['Open', 'High', 'Low', 'Close']):
    plt.subplot(2, 2, i + 1)
    data_grouped[col].plot.bar()
    plt.title(col)

plt.show()


From the above bar graph, we can conclude that the stock prices have doubled from the year 2013 to that in 2014.

In [None]:
df.groupby('is_quarter_end').mean()


Here are some of the important observations of the above-grouped data:

* Prices are higher in the months which are quarter end as compared to that of the non-quarter end months.

* The volume of trades is lower in the months which are quarter end.

In [None]:
df['open-close'] = df['Open'] - df['Close']
df['low-high'] = df['Low'] - df['High']
df['target'] = np.where(df['Close'].shift(-1) > df['Close'], 1, 0)


Above we have added some more columns which will help in the training of our model. We have added the target feature which is a signal whether to buy or not we will train our model to predict this only. But before proceeding let’s check whether the target is balanced or not using a pie chart.

In [None]:
plt.figure(figsize=(15, 6))
plt.pie(df['target'].value_counts().values,
		labels=[0, 1], autopct='%1.1f%%')
plt.show()


When we add features to our dataset we have to ensure that there are no highly correlated features as they do not help in the learning process of the algorithm.

In [None]:
plt.figure(figsize=(10, 10))

# As our concern is with the highly
# correlated features only so, we will visualize
# our heatmap as per that criteria only.
sb.heatmap(df.corr() > 0.9, annot=True, cbar=False,cmap='magma')
plt.show()


From the above heatmap, we can say that there is a high correlation between OHLC that is pretty obvious, and the added features are not highly correlated with each other or previously provided features which means that we are good to go and build our model.

# **Data Splitting and Normalization**

In [None]:
features = df[['open-close', 'low-high', 'is_quarter_end']]
target = df['target']

scaler = StandardScaler()
features = scaler.fit_transform(features)

X_train, X_valid, Y_train, Y_valid = train_test_split(
	features, target, test_size=0.1, random_state=2022)
print(X_train.shape, X_valid.shape)


After selecting the features to train the model on we should normalize the data because normalized data leads to stable and fast training of the model. After that whole data has been split into two parts with a 90/10 ratio so, that we can evaluate the performance of our model on unseen data.

# **Model Development and Evaluation**

Now is the time to train some state-of-the-art machine learning models(Logistic Regression, Support Vector Machine, XGBClassifier), and then based on their performance on the training and validation data we will choose which ML model is serving the purpose at hand better.

For the evaluation metric, we will use the ROC-AUC curve but why this is because instead of predicting the hard probability that is 0 or 1 we would like it to predict soft probabilities that are continuous values between 0 to 1. And with soft probabilities, the ROC-AUC curve is generally used to measure the accuracy of the predictions.

In [None]:
models = [LogisticRegression(), SVC(kernel='poly', probability=True), XGBClassifier()]

for i in range(3):
    models[i].fit(X_train, Y_train)

    print(f'{models[i]} : ')
    print('Training Accuracy : ', metrics.roc_auc_score(Y_train, models[i].predict_proba(X_train)[:,1]))
    print('Validation Accuracy : ', metrics.roc_auc_score(Y_valid, models[i].predict_proba(X_valid)[:,1]))
    print()


Among the three models, we have trained XGBClassifier has the highest performance but it is pruned to overfitting as the difference between the training and the validation accuracy is too high. But in the case of the Logistic Regression, this is not the case.

Now let’s plot a confusion matrix for the validation data.

In [None]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

# Assuming you have a trained model (e.g., models[0])
model = models[0]

# Make predictions on the validation data
y_pred = model.predict(X_valid)

# Calculate the confusion matrix
cm = confusion_matrix(Y_valid, y_pred)

# Create a heatmap of the confusion matrix
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.colorbar()

classes = [0, 1]
tick_marks = range(len(classes))
plt.xticks(tick_marks, classes)
plt.yticks(tick_marks, classes)

plt.xlabel('Predicted')
plt.ylabel('True')

thresh = cm.max() / 2.
for i in range(len(classes)):
    for j in range(len(classes)):
        plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")

plt.show()


# **Conclusion**

The stock market plays a remarkable role in our daily lives. It is a significant factor in a country's GDP growth. We can observe that the accuracy achieved by the state-of-the-art ML model is no better than simply guessing with a probability of 50%. Possible reasons for this may be the lack of data or using a very simple model to perform such a complex task as Stock Market prediction.

* The project demonstrates the use of machine learning to predict stock price movements.
* Further model tuning and evaluation metrics may be necessary to improve model performance.