# 선형회귀

현대 중공업과 계약을 맺어 일부 선박에 대한 예측 모델을 구축하게됐습니다. 현대 중공업은 세계 최대의 선박 제조업체 중 하나로 유람선을 제작하고 있습니다.
당신은 선박에 필요한 선원 수를 정확하게 예측할 수 있도록 울산에있는 본사에 도착했습니다.
그들은 현재 새로운 선박을 건조하고 있으며 예측 모델을 만들고, 이를 사용하여 선박에 필요한 승무원 수를 예측하기를 원합니다.

지금까지의 데이터는 다음과 같습니다.

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
위 데이터는 "cruise_ship_info.csv"라는 csv 파일에 저장됩니다. 귀하의 임무는 향후 선박에 필요한 선원 수를 예측하는 데 도움이되는 회귀 모델을 만드는 것입니다. 고객은 또한 특정 크루즈 라인이 허용되는 승무원 수에 차이가 있음을 발견 했으므로 분석에 포함하는 것이 가장 중요한 기능이라고 언급했습니다!

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler,MinMaxScaler, OneHotEncoder
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

In [2]:
ship = pd.read_csv("./data/cruise_ship_info_example.csv")
ship.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158 entries, 0 to 157
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         158 non-null    int64  
 1   Ship_name          158 non-null    object 
 2   Cruise_line        158 non-null    object 
 3   Age                158 non-null    int64  
 4   Tonnage            158 non-null    float64
 5   passengers         158 non-null    float64
 6   length             158 non-null    float64
 7   cabins             158 non-null    float64
 8   passenger_density  158 non-null    float64
 9   crew               110 non-null    float64
dtypes: float64(6), int64(2), object(2)
memory usage: 12.5+ KB


In [3]:
ohe = OneHotEncoder()

In [4]:
line_arr = np.array(ship["Ship_name"])
line_arr = np.reshape(line_arr, (-1, 1))
line = ohe.fit_transform(line_arr)
line_df = pd.DataFrame(line.toarray(), columns=ohe.get_feature_names_out())

In [5]:
ship = pd.concat([ship,line_df], axis=1)

In [6]:
ship.head()

Unnamed: 0.1,Unnamed: 0,Ship_name,Cruise_line,Age,Tonnage,passengers,length,cabins,passenger_density,crew,...,x0_Volendam,x0_Voyager,x0_Westerdam,x0_Whisper,x0_Wind,x0_Wonder,x0_Xpedition,x0_Zaandam,x0_Zenith,x0_Zuiderdam
0,0,Journey,Azamara,6,30.277,6.94,5.94,3.55,42.64,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Quest,Azamara,6,30.277,6.94,5.94,3.55,42.64,3.55,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,Celebration,Carnival,26,47.262,14.86,7.22,7.43,31.8,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,Conquest,Carnival,11,110.0,29.74,9.53,14.88,36.99,19.1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,Destiny,Carnival,17,101.353,26.42,8.92,13.21,38.36,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
ship = ship.iloc[:,3:]

In [8]:
ship.head()

Unnamed: 0,Age,Tonnage,passengers,length,cabins,passenger_density,crew,x0_Adventure,x0_Allegra,x0_Amsterdam,...,x0_Volendam,x0_Voyager,x0_Westerdam,x0_Whisper,x0_Wind,x0_Wonder,x0_Xpedition,x0_Zaandam,x0_Zenith,x0_Zuiderdam
0,6,30.277,6.94,5.94,3.55,42.64,,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,6,30.277,6.94,5.94,3.55,42.64,3.55,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,26,47.262,14.86,7.22,7.43,31.8,,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,11,110.0,29.74,9.53,14.88,36.99,19.1,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,17,101.353,26.42,8.92,13.21,38.36,10.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
ship.describe()

Unnamed: 0,Age,Tonnage,passengers,length,cabins,passenger_density,crew,x0_Adventure,x0_Allegra,x0_Amsterdam,...,x0_Volendam,x0_Voyager,x0_Westerdam,x0_Whisper,x0_Wind,x0_Wonder,x0_Xpedition,x0_Zaandam,x0_Zenith,x0_Zuiderdam
count,110.0,110.0,110.0,110.0,110.0,110.0,110.0,110.0,110.0,110.0,...,110.0,110.0,110.0,110.0,110.0,110.0,110.0,110.0,110.0,110.0
mean,16.127273,70.199473,18.150455,8.065545,8.751,39.712636,7.728909,0.009091,0.009091,0.0,...,0.009091,0.0,0.0,0.009091,0.009091,0.0,0.009091,0.009091,0.009091,0.0
std,8.045865,37.41013,9.643208,1.843385,4.43837,8.337648,3.563549,0.095346,0.095346,0.0,...,0.095346,0.0,0.0,0.095346,0.095346,0.0,0.095346,0.095346,0.095346,0.0
min,5.0,2.329,0.66,2.79,0.33,17.7,0.59,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,10.25,39.0,10.975,6.875,5.285,34.6125,5.2,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,14.0,71.899,19.5,8.55,9.75,39.085,8.63,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,21.0,91.0,24.845,9.51,11.31,44.005,10.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,48.0,160.0,43.7,11.32,18.17,67.35,19.1,1.0,1.0,0.0,...,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0


In [9]:
ship = ship[ship["crew"].isna()==False]

In [10]:
x = ship.drop("crew",axis=1)
y = ship["crew"]

x_train, x_test, y_train, y_test = train_test_split(
    x,y
)

In [11]:
x_train.shape, x_test.shape

((82, 144), (28, 144))

In [None]:
ms = MinMaxScaler()