# Stock Market Trend Analysis - Phase 1
## Notebook 1: Data Loading and Initial Verification

**Name:** Enerita Chatora  
**Date:** December 2025 
**Project:** Stock Market Trend Analysis - Global Investment Partners

---
## Setup: Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

print("\nLibraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

STOCK MARKET TREND ANALYSIS - DATA LOADING

Libraries imported successfully!
Pandas version: 2.3.3
NumPy version: 2.3.5


---
## Part 1: Load All Datasets

In [2]:

# Define file paths - CHANGE THIS to match your folder structure
data_path = "../data/raw/"

datasets = {
    'company_info': 'company_info.csv',
    'stock_prices': 'stock_prices.csv',
    'stock_prices_with_indicators': 'stock_prices_with_indicators.csv',
    'market_indices': 'market_indices.csv'
}

# Dictionary to store loaded dataframes
dfs = {}

# Load each dataset
for name, filename in datasets.items():
    try:
        filepath = data_path + filename
        df = pd.read_csv(filepath)
        dfs[name] = df
        print(f"✓ Loaded {name}: {df.shape[0]:,} rows × {df.shape[1]} columns")
    except FileNotFoundError:
        print(f"✗ ERROR: {filename} not found at {filepath}")
    except Exception as e:
        print(f"✗ ERROR loading {filename}: {str(e)}")

print("\nAll datasets loaded successfully!")

✓ Loaded company_info: 20 rows × 4 columns
✓ Loaded stock_prices: 15,600 rows × 8 columns
✓ Loaded stock_prices_with_indicators: 15,502 rows × 31 columns
✓ Loaded market_indices: 780 rows × 7 columns

All datasets loaded successfully!


---
## Part 2: Dataset 1 - Company Information

In [3]:


company_info = dfs['company_info']

print("\n1. Basic Information:")
print(f"   - Shape: {company_info.shape}")
print(f"   - Total companies: {company_info['ticker'].nunique()}")


1. Basic Information:
   - Shape: (20, 4)
   - Total companies: 20


In [4]:
print("\n2. Column Information:")
company_info.info()


2. Column Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   ticker        20 non-null     object
 1   company_name  20 non-null     object
 2   sector        20 non-null     object
 3   ipo_date      20 non-null     object
dtypes: object(4)
memory usage: 772.0+ bytes


In [5]:
print("\n3. First 5 Records:")
company_info.head()


3. First 5 Records:


Unnamed: 0,ticker,company_name,sector,ipo_date
0,STK001,TechCorp,Technology,2021-04-10
1,STK002,DataSystems,Technology,2016-10-12
2,STK003,CloudNine,Technology,2016-01-18
3,STK004,CyberShield,Technology,2022-02-21
4,STK005,MediPharm,Healthcare,2018-03-05


In [6]:
print("\n4. Sector Distribution:")
company_info['sector'].value_counts()


4. Sector Distribution:


sector
Technology    4
Healthcare    4
Finance       4
Consumer      4
Energy        4
Name: count, dtype: int64

In [7]:
print("\n5. Data Quality Check:")
print(f"   - Missing values: {company_info.isnull().sum().sum()}")
print(f"   - Duplicate records: {company_info.duplicated().sum()}")


5. Data Quality Check:
   - Missing values: 0
   - Duplicate records: 0


---
## Part 3: Dataset 2 - Stock Prices (OHLCV Data)

In [8]:

stock_prices = dfs['stock_prices']

print("\n1. Basic Information:")
print(f"   - Shape: {stock_prices.shape}")
print(f"   - Number of stocks: {stock_prices['ticker'].nunique()}")
print(f"   - Date range: {stock_prices['date'].min()} to {stock_prices['date'].max()}")


1. Basic Information:
   - Shape: (15600, 8)
   - Number of stocks: 20
   - Date range: 2021-01-04 to 2023-12-29


In [9]:
print("\n2. Column Information:")
stock_prices.info()


2. Column Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15600 entries, 0 to 15599
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   ticker          15600 non-null  object 
 1   date            15600 non-null  object 
 2   open            15600 non-null  float64
 3   high            15600 non-null  float64
 4   low             15600 non-null  float64
 5   close           15600 non-null  float64
 6   volume          15600 non-null  int64  
 7   adjusted_close  15600 non-null  float64
dtypes: float64(5), int64(1), object(2)
memory usage: 975.1+ KB


In [10]:
print("\n3. First 5 Records:")
stock_prices.head()


3. First 5 Records:


Unnamed: 0,ticker,date,open,high,low,close,volume,adjusted_close
0,STK001,2021-01-04,158.09,160.97,158.09,160.11,962644,160.11
1,STK001,2021-01-05,163.16,165.5,160.76,162.36,1312685,162.36
2,STK001,2021-01-06,161.89,162.51,160.94,161.78,1449177,161.78
3,STK001,2021-01-07,163.33,167.9,163.33,167.07,1534833,167.07
4,STK001,2021-01-08,168.2,168.2,164.12,165.68,848261,165.68


In [11]:
print("\n4. Price Statistics (Close Price):")
stock_prices['close'].describe()


4. Price Statistics (Close Price):


count    15600.000000
mean       102.609313
std         58.946467
min         21.440000
25%         63.340000
50%         88.735000
75%        131.160000
max        514.520000
Name: close, dtype: float64

In [12]:
print("\n5. Data Quality Check:")
missing_by_column = stock_prices.isnull().sum()
print("\n   Missing Values by Column:")
if missing_by_column.sum() > 0:
    print(missing_by_column[missing_by_column > 0])
else:
    print("   No missing values!")

print(f"\n   - Duplicate records: {stock_prices.duplicated().sum()}")


5. Data Quality Check:

   Missing Values by Column:
   No missing values!

   - Duplicate records: 0


---
## Part 4: Dataset 3 - Stock Prices with Technical Indicators (PRIMARY DATASET)

In [13]:

print("DATASET 3: STOCK PRICES WITH INDICATORS (PRIMARY DATASET)")


stock_indicators = dfs['stock_prices_with_indicators']

print("\n1. Basic Information:")
print(f"   - Shape: {stock_indicators.shape}")
print(f"   - Number of stocks: {stock_indicators['ticker'].nunique()}")
print(f"   - Total features: {stock_indicators.shape[1]}")

DATASET 3: STOCK PRICES WITH INDICATORS (PRIMARY DATASET)

1. Basic Information:
   - Shape: (15502, 31)
   - Number of stocks: 20
   - Total features: 31


In [14]:
print("\n2. Column Information:")
stock_indicators.info()


2. Column Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15502 entries, 0 to 15501
Data columns (total 31 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ticker            15502 non-null  object 
 1   date              15502 non-null  object 
 2   open              15502 non-null  float64
 3   high              15502 non-null  float64
 4   low               15502 non-null  float64
 5   close             15502 non-null  float64
 6   volume            15502 non-null  float64
 7   adjusted_close    15502 non-null  float64
 8   sma_20            15192 non-null  float64
 9   sma_50            15502 non-null  float64
 10  sma_200           15502 non-null  float64
 11  ema_12            15502 non-null  float64
 12  ema_26            15502 non-null  float64
 13  macd              15191 non-null  float64
 14  macd_signal       15502 non-null  float64
 15  macd_histogram    15502 non-null  float64
 16  rsi_14          

In [15]:
print("\n3. First 3 Records (showing key columns):")
key_cols = ['ticker', 'date', 'close', 'rsi_14', 'macd', 'sma_20', 'trend_label']
available_cols = [col for col in key_cols if col in stock_indicators.columns]
stock_indicators[available_cols].head(3)


3. First 3 Records (showing key columns):


Unnamed: 0,ticker,date,close,rsi_14,macd,sma_20,trend_label
0,STK001,2021-01-04,160.11,,0.0,160.11,Uptrend
1,STK001,2021-01-05,162.36,100.0,0.179487,161.235,Sideways
2,STK001,2021-01-06,161.78,79.5053,0.271798,161.416667,Sideways


In [16]:
print("\n4. Feature Categories:")

print("\n   Price Data:")
price_cols = ['open', 'high', 'low', 'close', 'volume', 'adjusted_close']
print(f"   {[col for col in price_cols if col in stock_indicators.columns]}")

print("\n   Moving Averages:")
ma_cols = [col for col in stock_indicators.columns if 'sma' in col or 'ema' in col]
print(f"   {ma_cols[:8]}")

print("\n   Momentum Indicators:")
momentum_cols = [col for col in stock_indicators.columns if any(x in col for x in ['rsi', 'momentum', 'roc', 'macd'])]
print(f"   {momentum_cols[:8]}")

print("\n   Volatility Indicators:")
volatility_cols = [col for col in stock_indicators.columns if any(x in col for x in ['atr', 'volatility', 'bb_'])]
print(f"   {volatility_cols[:6]}")


4. Feature Categories:

   Price Data:
   ['open', 'high', 'low', 'close', 'volume', 'adjusted_close']

   Moving Averages:
   ['sma_20', 'sma_50', 'sma_200', 'ema_12', 'ema_26', 'volume_sma_20', 'price_to_sma_50']

   Momentum Indicators:
   ['macd', 'macd_signal', 'macd_histogram', 'rsi_14', 'momentum_10', 'momentum_20']

   Volatility Indicators:
   ['bb_middle', 'bb_upper', 'bb_lower', 'bb_width', 'atr_14', 'volatility_20']


In [17]:
print("\n5. Target Variable Distribution:")
if 'trend_label' in stock_indicators.columns:
    print(stock_indicators['trend_label'].value_counts())
    print(f"\n   Class distribution (%):")
    print(stock_indicators['trend_label'].value_counts(normalize=True) * 100)
else:
    print("   Target variable 'trend_label' not found in dataset")


5. Target Variable Distribution:
trend_label
Uptrend      5532
Downtrend    5026
Sideways     4944
Name: count, dtype: int64

   Class distribution (%):
trend_label
Uptrend      35.685718
Downtrend    32.421623
Sideways     31.892659
Name: proportion, dtype: float64


In [18]:
print("\n6. Data Quality Check:")
missing_by_column = stock_indicators.isnull().sum()
print("\n   Missing Values by Column:")
if missing_by_column.sum() > 0:
    print(missing_by_column[missing_by_column > 0].sort_values(ascending=False).head(10))
else:
    print("   No missing values!")

print(f"\n   - Duplicate records: {stock_indicators.duplicated().sum()}")


6. Data Quality Check:

   Missing Values by Column:
momentum_20      400
bb_width         331
rsi_14           330
macd             311
sma_20           310
volume_ratio     310
momentum_10      200
volatility_20     40
bb_lower          20
bb_upper          20
dtype: int64

   - Duplicate records: 0


---
## Part 5: Dataset 4 - Market Indices

In [19]:

print("DATASET 4: MARKET INDICES")


market_indices = dfs['market_indices']

print("\n1. Basic Information:")
print(f"   - Shape: {market_indices.shape}")
print(f"   - Date range: {market_indices['date'].min()} to {market_indices['date'].max()}")

DATASET 4: MARKET INDICES

1. Basic Information:
   - Shape: (780, 7)
   - Date range: 2021-01-04 to 2023-12-29


In [20]:
print("\n2. Column Information:")
market_indices.info()


2. Column Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 780 entries, 0 to 779
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           780 non-null    object 
 1   sp500_close    780 non-null    float64
 2   nasdaq_close   780 non-null    float64
 3   vix_close      780 non-null    float64
 4   treasury_10y   780 non-null    float64
 5   dollar_index   780 non-null    float64
 6   market_regime  780 non-null    object 
dtypes: float64(5), object(2)
memory usage: 42.8+ KB


In [21]:
print("\n3. First 5 Records:")
market_indices.head()


3. First 5 Records:


Unnamed: 0,date,sp500_close,nasdaq_close,vix_close,treasury_10y,dollar_index,market_regime
0,2021-01-04,3891.04,12952.02,20.48,1.496,90.07,bull
1,2021-01-05,3847.2,13116.52,19.89,1.504,90.73,bull
2,2021-01-06,3878.24,12818.48,20.02,1.507,90.46,bull
3,2021-01-07,3803.95,13290.11,19.59,1.498,90.67,bull
4,2021-01-08,3715.64,13365.73,20.14,1.498,90.64,bull


In [22]:
print("\n4. Available Market Indicators:")
for col in market_indices.columns:
    if col != 'date':
        print(f"   - {col}")


4. Available Market Indicators:
   - sp500_close
   - nasdaq_close
   - vix_close
   - treasury_10y
   - dollar_index
   - market_regime


In [23]:
print("\n5. Market Regime Distribution:")
if 'market_regime' in market_indices.columns:
    market_indices['market_regime'].value_counts()


5. Market Regime Distribution:


In [24]:
print("\n6. Data Quality Check:")
missing_by_column = market_indices.isnull().sum()
print("\n   Missing Values by Column:")
if missing_by_column.sum() > 0:
    print(missing_by_column[missing_by_column > 0])
else:
    print("   No missing values!")


6. Data Quality Check:

   Missing Values by Column:
   No missing values!


---
## Part 6: Cross-Dataset Validation

In [25]:

print("CROSS-DATASET VALIDATION")


print("\n1. Ticker Consistency Check:")
tickers_company = set(company_info['ticker'].unique())
tickers_prices = set(stock_prices['ticker'].unique())
tickers_indicators = set(stock_indicators['ticker'].unique())

print(f"   - Company Info: {len(tickers_company)} tickers")
print(f"   - Stock Prices: {len(tickers_prices)} tickers")
print(f"   - Indicators: {len(tickers_indicators)} tickers")

if tickers_company == tickers_prices == tickers_indicators:
    print("\n   ✓ All datasets have consistent tickers")
else:
    print("\n   ⚠ Ticker mismatch detected between datasets")

CROSS-DATASET VALIDATION

1. Ticker Consistency Check:
   - Company Info: 20 tickers
   - Stock Prices: 20 tickers
   - Indicators: 20 tickers

   ✓ All datasets have consistent tickers


In [26]:
print("\n2. Date Range Consistency:")
print(f"   - Stock Prices: {stock_prices['date'].min()} to {stock_prices['date'].max()}")
print(f"   - Indicators: {stock_indicators['date'].min()} to {stock_indicators['date'].max()}")
print(f"   - Market Indices: {market_indices['date'].min()} to {market_indices['date'].max()}")


2. Date Range Consistency:
   - Stock Prices: 2021-01-04 to 2023-12-29
   - Indicators: 2021-01-04 to 2023-12-25
   - Market Indices: 2021-01-04 to 2023-12-29


---
## Part 7: Summary and Next Steps

In [27]:

print("DATA LOADING SUMMARY")


print("\n✓ SUCCESS: All 4 datasets loaded successfully!")
print("\nDataset Sizes:")
for name, df in dfs.items():
    print(f"   - {name}: {df.shape[0]:,} rows × {df.shape[1]} columns")

print("\n\nKEY FINDINGS:")
print("1. All datasets loaded without critical errors")
print(f"2. Primary dataset (indicators) has {stock_indicators.shape[1]} features")
if 'trend_label' in stock_indicators.columns:
    print(f"3. Target variable 'trend_label' has {stock_indicators['trend_label'].nunique()} classes")
print(f"4. Date coverage: {stock_indicators['date'].min()} to {stock_indicators['date'].max()}")

print("\n\nNEXT STEPS:")
print("1. Move to Notebook 02: Data Quality Assessment")
print("2. Handle missing values and data type conversions")
print("3. Prepare data for exploratory analysis")


print("NOTEBOOK 1 COMPLETE")


DATA LOADING SUMMARY

✓ SUCCESS: All 4 datasets loaded successfully!

Dataset Sizes:
   - company_info: 20 rows × 4 columns
   - stock_prices: 15,600 rows × 8 columns
   - stock_prices_with_indicators: 15,502 rows × 31 columns
   - market_indices: 780 rows × 7 columns


KEY FINDINGS:
1. All datasets loaded without critical errors
2. Primary dataset (indicators) has 31 features
3. Target variable 'trend_label' has 3 classes
4. Date coverage: 2021-01-04 to 2023-12-25


NEXT STEPS:
1. Move to Notebook 02: Data Quality Assessment
2. Handle missing values and data type conversions
3. Prepare data for exploratory analysis
NOTEBOOK 1 COMPLETE
