## KBO Batting Dataset - Data Analysis and Model Building Report

## 1. Introduction 

The purpose of this report is to analyze the KBO (Korean Baseball Organization) Batting dataset and build a machine learning model using DBSCAN clustering. The report covers dataset exploration, preprocessing, missing value handling, model building, evaluation, and conclusions.

## 2. Libraries Used

The following Python libraries were used in this project:

- **Pandas**: For data manipulation and analysis

- **NumPy**: For numerical computations

- **Matplotlib & Seaborn**: For data visualization

- **Scikit-learn**: For machine learning and clustering (DBSCAN)

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

## 3. Dataset Overview

The dataset consists of batting statistics of players in the Korean Baseball Organization. The data includes various performance metrics such as hits, runs, home runs, batting averages, etc.

### 3.1 Number of Rows and Columns

In [15]:
df = pd.read_csv(r"C:\Users\Shaik Sakhlaih\Downloads\kbobattingdata.csv")
print(df.shape)

(323, 27)


### 3.2 Column Names

In [16]:
print(df.columns)

Index(['year', 'team', 'average_batter_age', 'runs_per_game', 'games',
       'plate_appearances', 'at_bats', 'runs', 'hits', 'doubles', 'triples',
       'homeruns', 'RBI', 'stolen_bases', 'caught_stealing', 'bases_on_balls',
       'strikeouts', 'batting_average', 'OBP', 'SLG', 'OPS', 'total_bases',
       'GDP', 'HBP', 'sacrifice_hits', 'sacrifice_flies', 'IBB'],
      dtype='object')


## 4. Exploratory Data Analysis (EDA)

### 4.1 Displaying First Few Rows

In [17]:
print(df.head())

   year           team  average_batter_age  runs_per_game  games  \
0  2021    SSG Landers                30.9           5.26    143   
1  2021   Doosan Bears                29.0           5.13    143   
2  2021   Lotte Giants                29.0           5.06    143   
3  2021  Kiwoom Heroes                27.1           5.01    143   
4  2021         KT Wiz                29.4           4.97    143   

   plate_appearances  at_bats  runs  hits  doubles  ...  batting_average  \
0               5698     4864   752  1268      203  ...            0.261   
1               5606     4867   733  1306      234  ...            0.268   
2               5689     4978   723  1384      263  ...            0.278   
3               5610     4839   716  1250      243  ...            0.258   
4               5581     4773   711  1263      217  ...            0.265   

     OBP    SLG    OPS  total_bases  GDP  HBP  sacrifice_hits  \
0  0.354  0.421  0.775         2049  104   93              55   
1  0

### 4.2 Summary Statistics

In [18]:
print(df.describe())

              year  average_batter_age  runs_per_game       games  \
count   323.000000          323.000000      323.00000  323.000000   
mean   2002.944272           27.785759        4.61161  128.142415   
std      11.501957            1.335393        0.73503   12.996350   
min    1982.000000           24.300000        2.71000   80.000000   
25%    1993.000000           26.800000        4.00500  126.000000   
50%    2003.000000           27.900000        4.59000  128.000000   
75%    2013.000000           28.900000        5.15500  133.000000   
max    2021.000000           30.900000        6.57000  144.000000   

       plate_appearances      at_bats        runs         hits     doubles  \
count         323.000000   323.000000  323.000000   323.000000  323.000000   
mean         4919.773994  4293.430341  595.095975  1146.535604  197.798762   
std           585.491296   508.741014  132.387325   179.164714   39.628473   
min          2953.000000  2628.000000  302.000000   637.000000  11

### 4.3 Checking Data Types

In [19]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 323 entries, 0 to 322
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   year                323 non-null    int64  
 1   team                323 non-null    object 
 2   average_batter_age  323 non-null    float64
 3   runs_per_game       323 non-null    float64
 4   games               323 non-null    int64  
 5   plate_appearances   323 non-null    int64  
 6   at_bats             323 non-null    int64  
 7   runs                323 non-null    int64  
 8   hits                323 non-null    int64  
 9   doubles             323 non-null    int64  
 10  triples             323 non-null    int64  
 11  homeruns            323 non-null    int64  
 12  RBI                 323 non-null    int64  
 13  stolen_bases        184 non-null    float64
 14  caught_stealing     184 non-null    float64
 15  bases_on_balls      323 non-null    int64  
 16  strikeou

## 5. Handling Missing Values

### 5.1 Checking for Null Values

In [21]:
print(df.isnull().sum())

year                    0
team                    0
average_batter_age      0
runs_per_game           0
games                   0
plate_appearances       0
at_bats                 0
runs                    0
hits                    0
doubles                 0
triples                 0
homeruns                0
RBI                     0
stolen_bases          139
caught_stealing       139
bases_on_balls          0
strikeouts              0
batting_average         0
OBP                     0
SLG                     0
OPS                     0
total_bases             0
GDP                     0
HBP                     0
sacrifice_hits          0
sacrifice_flies         0
IBB                   139
dtype: int64


### 5.2 Replacing Missing Values

Missing values can be handled by:

- Filling numerical columns with their **mean/median**

- Filling categorical columns with **mode**

In [None]:
df.fillna(df.mean(), inplace=True)
df.fillna(df.mode().iloc[0], inplace=True)

## 6. Data Preprocessing

### 6.1 Standardizing the Data

DBSCAN requires normalized data, so we use `StandardScaler`:

In [None]:
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df.select_dtypes(include=[np.number]))

## 7. Model Building

We apply DBSCAN clustering to identify groups of similar players.

In [None]:
dbscan = DBSCAN(eps=0.5, min_samples=5)
df['cluster'] = dbscan.fit_predict(df_scaled)

### 7.1 Visualizing Clusters

In [None]:
plt.scatter(df_scaled[:, 0], df_scaled[:, 1], c=df['cluster'], cmap='viridis')
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("DBSCAN Clustering of KBO Batting Data")
plt.show()

## 8. Model Evaluation

DBSCAN does not use traditional accuracy metrics but can be evaluated based on:

- **Number of clusters formed**

- **Noise points (label = -1)**

In [None]:
print(df['cluster'].value_counts())

## 9. Conclusion

- The dataset was successfully preprocessed and analyzed.

- Missing values were handled using appropriate techniques.

- DBSCAN clustering was applied to group players based on their batting statistics.

- The results showed distinct clusters and some noise points.

- Future improvements could include hyperparameter tuning and feature selection.

This report provides a detailed overview of the analysis and model-building process using DBSCAN on the KBO Batting dataset.

