# **Train-Test Split**

The train-test split is a common technique used to evaluate the performance of a machine learning model. 
It involves splitting the available data into two sets: a training set and a testing set.
The training set is used to train the model, and the testing set is used to evaluate its performance.

The train-test split is typically done in the following way:
1. Split the data into two sets: a training set and a testing set.
2. Use the training set to train the model.
3. Use the testing set to evaluate the performance of the model.

The train-test split is a crucial step in the machine learning process because it allows us to evaluate the performance of a model on unseen data.


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [7]:
df = pd.read_csv("https://github.com/RyanNolanData/YouTubeData/blob/main/500hits.csv?raw=true", encoding="latin-1")
df.head()

Unnamed: 0,PLAYER,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,BA,HOF
0,Ty Cobb,24,3035,11434,2246,4189,724,295,117,726,1249,357,892,178,0.366,1
1,Stan Musial,22,3026,10972,1949,3630,725,177,475,1951,1599,696,78,31,0.331,1
2,Tris Speaker,22,2789,10195,1882,3514,792,222,117,724,1381,220,432,129,0.345,1
3,Derek Jeter,20,2747,11195,1923,3465,544,66,260,1311,1082,1840,358,97,0.31,1
4,Honus Wagner,21,2792,10430,1736,3430,640,252,101,0,963,327,722,15,0.329,1


In [9]:
X = df.drop(columns = ["PLAYER", "HOF"])
y = df["HOF"]

In [10]:
X.head()

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,BA
0,24,3035,11434,2246,4189,724,295,117,726,1249,357,892,178,0.366
1,22,3026,10972,1949,3630,725,177,475,1951,1599,696,78,31,0.331
2,22,2789,10195,1882,3514,792,222,117,724,1381,220,432,129,0.345
3,20,2747,11195,1923,3465,544,66,260,1311,1082,1840,358,97,0.31
4,21,2792,10430,1736,3430,640,252,101,0,963,327,722,15,0.329


In [12]:
X.shape

(465, 14)

In [13]:
y.shape

(465,)

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state= 45)

In [17]:
display(X_train.shape)
display(X_test.shape)

(372, 14)

(93, 14)

In [18]:
display(y_train.shape)
display(y_test.shape)

(372,)

(93,)

In [20]:
display(X_train.describe().round(2))
display(y_train.describe().round(3))

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,BA
count,372.0,372.0,372.0,372.0,372.0,372.0,372.0,372.0,372.0,372.0,372.0,372.0,372.0,372.0
mean,17.1,2050.55,7508.85,1152.03,2171.93,382.65,77.45,206.37,920.2,790.01,867.63,193.59,60.11,0.29
std,2.77,355.82,1301.63,297.92,425.26,95.98,48.28,145.65,476.29,327.31,485.97,184.43,48.79,0.02
min,11.0,1331.0,4981.0,601.0,1660.0,177.0,3.0,9.0,0.0,239.0,0.0,7.0,0.0,0.25
25%,15.0,1801.75,6519.75,933.25,1837.0,312.0,39.0,83.0,686.0,548.0,465.0,62.75,26.75,0.27
50%,17.0,1992.0,7261.5,1108.5,2085.5,366.0,67.0,187.0,978.5,736.0,848.0,135.5,55.0,0.29
75%,19.0,2229.5,8144.0,1287.5,2375.0,440.0,105.5,299.25,1207.5,949.5,1237.25,275.0,86.25,0.3
max,26.0,3308.0,12364.0,2295.0,4189.0,792.0,295.0,755.0,2297.0,2190.0,2597.0,1406.0,335.0,0.37


count    372.000
mean       0.347
std        0.482
min        0.000
25%        0.000
50%        0.000
75%        1.000
max        2.000
Name: HOF, dtype: float64

In [21]:
display(X_test.describe().round(2))
display(y_test.describe().round(3))

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,BA
count,93.0,93.0,93.0,93.0,93.0,93.0,93.0,93.0,93.0,93.0,93.0,93.0,93.0,93.0
mean,16.84,2041.3,7521.86,1143.44,2163.52,374.15,82.98,179.75,790.48,757.78,766.83,205.17,49.98,0.29
std,2.76,350.44,1270.22,255.15,422.1,98.69,53.52,133.85,513.6,328.43,496.5,171.75,44.18,0.02
min,12.0,1437.0,5603.0,651.0,1661.0,182.0,11.0,16.0,0.0,276.0,14.0,14.0,0.0,0.25
25%,15.0,1802.0,6527.0,954.0,1839.0,311.0,47.0,67.0,527.0,510.0,327.0,67.0,15.0,0.27
50%,17.0,1993.0,7147.0,1083.0,2061.0,368.0,69.0,144.0,885.0,719.0,741.0,167.0,41.0,0.29
75%,19.0,2323.0,8498.0,1320.0,2383.0,425.0,109.0,264.0,1195.0,966.0,1125.0,302.0,79.0,0.3
max,25.0,2968.0,11008.0,1774.0,3430.0,657.0,309.0,548.0,1704.0,1865.0,2003.0,739.0,178.0,0.34


count    93.000
mean      0.258
std       0.440
min       0.000
25%       0.000
50%       0.000
75%       1.000
max       1.000
Name: HOF, dtype: float64