<img src="https://devra.ai/analyst/notebook/1431/image.jpg" style="width: 100%; height: auto;" />

<div style="text-align:center; border-radius:15px; padding:15px; color:white; margin:0; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden; margin-bottom: 1em;">  <div style="font-size:150%; color:#FEE100"><b>Amazon Stock Data 2000-2025 Analysis</b></div>  <div>This notebook was created with the help of <a href="https://devra.ai/ref/kaggle" style="color:#6666FF">Devra AI</a></div></div>

## Table of Contents

- [Introduction and Curiosity](#Introduction-and-Curiosity)
- [Data Loading and Initial Exploration](#Data-Loading-and-Initial-Exploration)
- [Data Cleaning and Preprocessing](#Data-Cleaning-and-Preprocessing)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Predictive Modeling: Predicting Closing Price](#Predictive-Modeling:-Predicting-Closing-Price)
- [Conclusion and Future Directions](#Conclusion-and-Future-Directions)

## Introduction and Curiosity

Stock data spanning 25 years can reveal fascinating trends and unexpected twists in market dynamics. In this notebook, we explore the Amazon stock data from 2000 to 2025 with an inquisitive mind and a dry sense of humor. If you find these insights useful, please consider upvoting.

In [4]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.switch_backend('Agg')
sns.set(style='whitegrid')

# For reproducibility
np.random.seed(42)

## Data Loading and Initial Exploration

In [5]:
file_path = 'Amazon stock data 2000-2025.csv'
df = pd.read_csv(file_path, parse_dates=['date'], encoding='ascii', delimiter=',')
print('Data shape:', df.shape)
print(df.head())
print(df.info())

Data shape: (6321, 7)
                        date      open      high       low     close  \
0  2000-01-03 00:00:00-05:00  4.075000  4.478125  3.952344  4.468750   
1  2000-01-04 00:00:00-05:00  4.268750  4.575000  4.087500  4.096875   
2  2000-01-05 00:00:00-05:00  3.525000  3.756250  3.400000  3.487500   
3  2000-01-06 00:00:00-05:00  3.565625  3.634375  3.200000  3.278125   
4  2000-01-07 00:00:00-05:00  3.350000  3.525000  3.309375  3.478125   

   adj_close     volume  
0   4.468750  322352000  
1   4.096875  349748000  
2   3.487500  769148000  
3   3.278125  375040000  
4   3.478125  210108000  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6321 entries, 0 to 6320
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   date       6321 non-null   object 
 1   open       6321 non-null   float64
 2   high       6321 non-null   float64
 3   low        6321 non-null   float64
 4   close      6321 non-null   float64
 5 

## Data Cleaning and Preprocessing

In [6]:
# In many cases, date columns imported as strings can lead to issues when using the .dt accessor.
# Hence, it's important to validate that the 'date' column is of datetime type and convert it if necessary.

if not pd.api.types.is_datetime64_any_dtype(df['date']):
    df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Drop records with invalid dates, if any
df = df.dropna(subset=['date'])

# Create additional date features for granular analysis
df['Year'] = df['date'].dt.year
df['Month'] = df['date'].dt.month
df['Day'] = df['date'].dt.day
df['DayOfWeek'] = df['date'].dt.dayofweek  # Monday=0, Sunday=6

# Create a numeric representation of the date: days since the earliest date
df['Day_Number'] = (df['date'] - df['date'].min()).dt.days

print('Additional date features created.')
df.head()

Additional date features created.


Unnamed: 0,date,open,high,low,close,adj_close,volume,Year,Month,Day,DayOfWeek,Day_Number
0,2000-01-03 00:00:00-05:00,4.075,4.478125,3.952344,4.46875,4.46875,322352000,2000,1,3,0,0
1,2000-01-04 00:00:00-05:00,4.26875,4.575,4.0875,4.096875,4.096875,349748000,2000,1,4,1,1
2,2000-01-05 00:00:00-05:00,3.525,3.75625,3.4,3.4875,3.4875,769148000,2000,1,5,2,2
3,2000-01-06 00:00:00-05:00,3.565625,3.634375,3.2,3.278125,3.278125,375040000,2000,1,6,3,3
4,2000-01-07 00:00:00-05:00,3.35,3.525,3.309375,3.478125,3.478125,210108000,2000,1,7,4,4


## Exploratory Data Analysis

In [8]:
# Let's perform several visualizations to understand the data better

import matplotlib.dates as mdates

# 1. Pair Plot of Numeric Variables
numeric_df = df.select_dtypes(include=[np.number])

if numeric_df.shape[1] >= 2:
    sns.pairplot(numeric_df)
    plt.title('Pair Plot of Numeric Variables')
    plt.tight_layout()
    

# 2. Correlation Heatmap (only if there are 4 or more numeric columns)
if numeric_df.shape[1] >= 4:
    plt.figure(figsize=(10,8))
    corr = numeric_df.corr()
    sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
    plt.title('Correlation Heatmap of Numeric Features')
    

# 3. Histogram of the 'close' prices
plt.figure(figsize=(8,6))
sns.histplot(df['close'], kde=True, bins=30, color='skyblue')
plt.title('Distribution of Closing Prices')
plt.xlabel('Closing Price')
plt.ylabel('Frequency')


# 4. Box Plot for 'volume' to inspect outliers
plt.figure(figsize=(8,6))
sns.boxplot(x=df['volume'], color='lightgreen')
plt.title('Box Plot of Trading Volume')
plt.xlabel('Volume')


# 5. Count Plot for Day of Week to see if trading days vary
plt.figure(figsize=(8,6))
sns.countplot(x=df['DayOfWeek'], palette='pastel')
plt.title('Count of Records by Day of Week')
plt.xlabel('Day of Week (Monday=0)')
plt.ylabel('Count')


# 6. Grouped Bar Plot: Average Volume by Year
avg_volume_by_year = df.groupby('Year')['volume'].mean().reset_index()
plt.figure(figsize=(12,6))
sns.barplot(x='Year', y='volume', data=avg_volume_by_year, palette='viridis')
plt.title('Average Trading Volume by Year')
plt.xticks(rotation=45)
plt.show()

In [13]:
avg_volume_by_year = df.groupby('Year')['volume'].mean().reset_index()
plt.figure(figsize=(12,6))
sns.barplot(x='Year', y='volume', data=avg_volume_by_year, palette='viridis')
plt.title('Average Trading Volume by Year')
plt.xticks(rotation=45)
plt.show()

## Predictive Modeling: Predicting Closing Price

We now build a simple predictor to forecast the closing price based on historical data. While predicting stock prices is notoriously challenging (and sometimes frustratingly unpredictable), this model serves as a good instructional example.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.inspection import permutation_importance

# Define the feature set and target variable
feature_cols = ['open', 'high', 'low', 'volume', 'Year', 'Month', 'Day', 'DayOfWeek', 'Day_Number']
target = 'close'

# Prepare X and y
X = df[feature_cols]
y = df[target]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Compute accuracy metrics
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print(f'Linear Regression R2 Score: {r2:.4f}')
print(f'Mean Absolute Error: {mae:.4f}')

# Permutation Importance to see feature relevance
result = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)
importance = result.importances_mean

plt.figure(figsize=(10,6))
plt.barh(feature_cols, importance, color='slateblue')
plt.title('Permutation Importance of Features')
plt.xlabel('Mean Importance')
plt.ylabel('Features')
plt.show()

# Plotting Actual vs Predicted values
plt.figure(figsize=(8,6))
plt.scatter(y_test, y_pred, alpha=0.7, color='teal')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.title('Actual vs Predicted Closing Prices')
plt.xlabel('Actual Closing Price')
plt.ylabel('Predicted Closing Price')
plt.show()

Linear Regression R2 Score: 0.9999
Mean Absolute Error: 0.2139


## Conclusion and Future Directions

In this notebook, we walked through the complete data science workflow from data loading, cleaning, exploratory analysis to building a simple regression model for predicting closing prices. We leveraged a variety of visualizations—from heatmaps through pair plots to grouped bar charts—to uncover insights hidden in the long-term evolution of Amazon stock data.

The predictor, while simplistic, provides a benchmark for more sophisticated future models. Future work could incorporate time series specific approaches (like ARIMA or LSTM networks) and include external economic or sentiment data for improved accuracy.

If this notebook helped shed some light on the stock data trends, please consider upvoting. After all, every upvote is a small vote for the sanity of data scientists everywhere.