# **1. Data 수집**
- url : https://dacon.io/competitions/open/235536/data
- 2010년대 한국에서 개봉한 한국영화 600개에 대한 감독, 이름, 상영등급, 관객수 등의 정보가 담긴 데이터
>- title : 영화의 제목
>- distributor : 배급사
>- genre : 장르
>- release_time : 개봉일
>- time : 상영시간(분)
>- screening_rat : 상영등급
>- director : 감독이름
>- dir_prev_bfnum : 해당 감독이 이 영화를 만들기 전 제작에 참여한 영화에서의 평균 관객수(단 관객수가 알려지지 않은 영화 제외)
>- dir_prev_num : 해당 감독이 이 영화를 만들기 전 제작에 참여한 영화의 개수(단 관객수가 알려지지 않은 영화 제외)
>- num_staff : 스텝수
>- num_actor : 주연배우수
>- box_off_num : 관객수

In [1]:
from google.colab import drive
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

In [2]:
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
try:
  df = pd.read_csv("/content/sample_data/movies_train.csv")
  df_test=pd.read_csv('/content/sample_data/movies_test.csv')
except:
  df = pd.read_csv("/content/drive/MyDrive/기계학습 기말고사 프로젝트/data/movies_train.csv")
  df_test = pd.read_csv("/content/drive/MyDrive/기계학습 기말고사 프로젝트/data/movies_test.csv")


---

In [5]:
print("전체 데이터 크기:", df.shape)

전체 데이터 크기: (600, 12)


In [6]:
# test data에 우리가 구해야하는 값인 관객 수 칼럼이 없음
set(df.columns)-set(df_test.columns)

{'box_off_num'}

In [7]:
print("훈련 데이터 크기:", df.shape)

훈련 데이터 크기: (600, 12)


#**2. Data 분석 및 탐색**

In [8]:
# 범주형 변수와 숫자형 변수 구분하기
cat_features=[]
num_features=[]
for column, i in zip(df.columns,df.dtypes):
  if i==object:
    cat_features.append(column)
  else:
    num_features.append(column)

##2-1)숫자형 변수들 탐색

In [9]:
num_features

['time',
 'dir_prev_bfnum',
 'dir_prev_num',
 'num_staff',
 'num_actor',
 'box_off_num']

In [10]:
df.isna().sum() 

title               0
distributor         0
genre               0
release_time        0
time                0
screening_rat       0
director            0
dir_prev_bfnum    330
dir_prev_num        0
num_staff           0
num_actor           0
box_off_num         0
dtype: int64

> <빈 값 처리><br>
> dir_prev_bfnum (해당 감독이 이 영화를 만들기 전 제작에 참여한 영화에서의 평균 관객수)<br> => 상관관계 0.3으로 엄청 낮진 않지만 감독별로 편차도 심하고 우리가 정보를 얻어서 입력할 수 없는 데이터로 판단되어 삭제함 <br>

In [11]:
df.drop('dir_prev_bfnum',axis=1,inplace=True)

In [12]:
df.isna().sum() 

title            0
distributor      0
genre            0
release_time     0
time             0
screening_rat    0
director         0
dir_prev_num     0
num_staff        0
num_actor        0
box_off_num      0
dtype: int64

In [13]:
fig = px.imshow(df.corr())
fig.update_layout(title="숫자형 변수들간 상관관계<br><sup>box_off_num(관객수)에 대해 num_staff(스탭수, 0.54)와 time(영화길이, 0.44)가 그나마 의미 있는 상관관계를 보임</sup>", template="simple_white")
fig.show()

> num_staff(스탭수, 0.54)와 time(영화길이, 0.44)에 대해 추가 탐색

In [14]:
fig = px.scatter(df, x="num_staff", y="box_off_num",trendline="ols", trendline_color_override="red")
fig.update_layout(title="스탭수 X 관객수 (상관관계: 0.54)<br><sup>스탭수가 적으면 많은 관객수를 얻기 힘듬", template="simple_white")
fig.show()


pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.



In [15]:
fig = px.scatter(df, x="time", y="box_off_num", trendline="ols", trendline_color_override="red")
fig.update_layout(title="영화길이 X 관객수 (상관관계: 0.44)<br><sup>120~140분 구간에서 가장 많은 관객수를 보임. 140분 이상인 영화가 몇개 없음", template="simple_white")
fig.update_xaxes(title="시간(분)")
fig.show()

##2-2)범주형 변수들 탐색

In [16]:
cat_features

['title', 'distributor', 'genre', 'release_time', 'screening_rat', 'director']

##(1) title 변수
- 영화 제목의 글자수에 따라 관객수의 차이가 있을 것 같아 두 변수간 상관관계를 살펴 보았음
- 1글자 또는 19글자 이상부터 평균 관객수가 적은 것으로 보이나 샘플 수가 적어 신뢰할 수 없음
- 또한, 글자수와 관객수간 눈에 띄는 상관관계가 보이지 않음
- 결과적으로 제목의 글자수는 관객수에 영향을 미치지 않으므로 최종 데이터에서 제거할 예정

In [17]:
df["title_length"] = df["title"].apply(lambda x: len(x))

In [18]:
fig = px.scatter(df, x="title_length", y="box_off_num")
fig.update_layout(title="영화제목 글자수 X 관객수", template="simple_white")
fig.show()

In [19]:
df_title_length = df.groupby(["title_length"]).mean()["box_off_num"].reset_index()
fig = px.bar(df_title_length, x="title_length", y="box_off_num")
fig.update_layout(title="영화제목 글자수 X 관객수(평균)", template="simple_white")

In [20]:
#drop 'title'변수
df.drop('title',axis=1,inplace=True)

## (2)Distributor 변수
- 배급사 마다 평균 관객수가 다름 -> 배급사를 보고 관객수를 예측할 수 있음
- 인코딩을 진행해야 하는데, 관객수로 구간을 나누고 싶음
- 배급사별 평균 관객수 분포를 기준으로 나누고자함
  - 1M(백만) 이상은 1M씩 구간을 나눔
  - 1M 미만은 아래의 3가지로 구간을 나눔
    - (1M > x >= 0.5M)
    - (0.5M > x >= 0.01M)
    - (0.01M > X)

In [21]:
dataset = df.copy()

In [22]:
print("배급사:", len(dataset["distributor"].unique()))
print(dataset["distributor"].unique())

배급사: 169
['롯데엔터테인먼트' '(주)쇼박스' '(주)NEW' '쇼박스(주)미디어플렉스' '백두대간' '유니버설픽쳐스인터내셔널코리아'
 '(주)두타연' '(주) 케이알씨지' '(주)콘텐츠 윙' '(주)키노아이' '(주)팝 파트너스' 'CJ E&M 영화부문'
 '(주) 영화제작전원사' 'CJ E&M Pictures' 'CGV 무비꼴라쥬' '리틀빅픽처스' '스폰지' 'CJ 엔터테인먼트'
 'CGV아트하우스' '조이앤시네마' '인디플러그' '콘텐츠판다' '인디스토리' '(주)팝엔터테인먼트' '시네마서비스' '웃기씨네'
 '영화사 진진' '(주)레인보우 팩토리' '김기덕 필름' 'NEW' 'CJ CGV' '동국대학교 충무로영상제작센터'
 'BoXoo 엔터테인먼트' '(주)마운틴픽쳐스' 'CGV 아트하우스' '메가박스(주)플러스엠' '골든타이드픽처스' '파이오니아21'
 '디 씨드' '드림팩트 엔터테인먼트' '시너지' '디마엔터테인먼트' '판다미디어' '(주)스톰픽쳐스코리아'
 '(주)예지림 엔터테인먼트' '(주) 영화사조제' '보람엔터테인먼트' '(주)시네마달' '노바엔터테인먼트' '(주)패스파인더씨앤씨'
 '(주)대명문화공장' '(주)온비즈넷' 'KT&G 상상마당' '무비꼴라쥬' '인벤트 디' '씨네그루(주)키다리이엔티'
 '스튜디오후크' '시네마 달' '나이너스엔터테인먼트(주)' 'THE 픽쳐스' '영구아트무비' '리틀빅픽쳐스' '어뮤즈'
 '이모션 픽처스' '(주)이스트스카이필름' '필라멘트 픽쳐스' '조이앤컨텐츠그룹' '타임스토리그룹' '마운틴 픽처스'
 '(주)휘엔터테인먼트' '이십세기폭스코리아(주)' '(주)피터팬픽쳐스' '에스와이코마드' '(주)더픽쳐스' '오퍼스픽쳐스'
 '(주)고앤고 필름' '사람과 사람들' '(주)JK필름' '씨너스엔터테인먼트(주)' 'KT' '싸이더스FNH' '(주)프레인글로벌'
 '나우콘텐츠' '홀리가든' '(주) 브릿지웍스' '(주)엣나인필름' '위더스필름' '시네마달' '(주)에이원 엔터테인먼트'
 'OAL(올)' '싸

In [23]:
# 동일 배급사인데 표현명이 다른 것들이 있어서 통일시킴

In [24]:
dataset["distributor"] = dataset["distributor"].apply(lambda x: "CGV" if "CGV" in x else x)
dataset["distributor"] = dataset["distributor"].apply(lambda x: "쇼박스" if "쇼박스" in x else x)
dataset["distributor"] = dataset["distributor"].apply(lambda x: "CJ엔터테인먼트" if "CJ" in x else x)
dataset["distributor"] = dataset["distributor"].apply(lambda x: x.replace("(주)", ""))
dataset["distributor"] = dataset["distributor"].apply(lambda x: x.replace(" ", ""))

In [25]:
dataset["distributor"] = dataset["distributor"].apply(lambda x: x.replace("CGV무비꼴라쥬", "CGV"))
dataset["distributor"] = dataset["distributor"].apply(lambda x: x.replace("무비꼴라쥬", "CGV"))
dataset["distributor"] = dataset["distributor"].apply(lambda x: x.replace("CGV아트하우스", "CGV"))

In [26]:
print("배급사:", len(dataset["distributor"].unique()))
print(dataset["distributor"].unique())

배급사: 153
['롯데엔터테인먼트' '쇼박스' 'NEW' '백두대간' '유니버설픽쳐스인터내셔널코리아' '두타연' '케이알씨지' '콘텐츠윙'
 '키노아이' '팝파트너스' 'CJ엔터테인먼트' '영화제작전원사' 'CGV' '리틀빅픽처스' '스폰지' '조이앤시네마' '인디플러그'
 '콘텐츠판다' '인디스토리' '팝엔터테인먼트' '시네마서비스' '웃기씨네' '영화사진진' '레인보우팩토리' '김기덕필름'
 '동국대학교충무로영상제작센터' 'BoXoo엔터테인먼트' '마운틴픽쳐스' '메가박스플러스엠' '골든타이드픽처스' '파이오니아21'
 '디씨드' '드림팩트엔터테인먼트' '시너지' '디마엔터테인먼트' '판다미디어' '스톰픽쳐스코리아' '예지림엔터테인먼트'
 '영화사조제' '보람엔터테인먼트' '시네마달' '노바엔터테인먼트' '패스파인더씨앤씨' '대명문화공장' '온비즈넷'
 'KT&G상상마당' '인벤트디' '씨네그루키다리이엔티' '스튜디오후크' '나이너스엔터테인먼트' 'THE픽쳐스' '영구아트무비'
 '리틀빅픽쳐스' '어뮤즈' '이모션픽처스' '이스트스카이필름' '필라멘트픽쳐스' '조이앤컨텐츠그룹' '타임스토리그룹'
 '마운틴픽처스' '휘엔터테인먼트' '이십세기폭스코리아' '피터팬픽쳐스' '에스와이코마드' '더픽쳐스' '오퍼스픽쳐스' '고앤고필름'
 '사람과사람들' 'JK필름' '씨너스엔터테인먼트' 'KT' '싸이더스FNH' '프레인글로벌' '나우콘텐츠' '홀리가든'
 '브릿지웍스' '엣나인필름' '위더스필름' '에이원엔터테인먼트' 'OAL(올)' '싸이더스' '전망좋은영화사' '스토리셋'
 '이상우필름' '씨네굿필름' '영희야놀자' '찬란' '어썸피플' '아방가르드필름' '스크린조이' '와이드릴리즈' 'tvN'
 '액티버스엔터테인먼트' '더픽쳐스/마운틴픽쳐스' '제나두엔터테인먼트' '아이필름코퍼레이션' '쟈비스미디어' '트리필름' '에스피엠'
 '건시네마' '키노엔터테인먼트' '아우라픽처스' '에이블엔터테인먼트' '드림로드' '인피니티엔터테인먼트' '새인컴퍼니'
 '스튜디오

In [27]:
fig =px.scatter(dataset, x = "distributor", y= "box_off_num")
fig.update_layout(title="배급사 X 관객수", template="simple_white")
fig.show()

In [28]:
# 배급사별 평균 관객수
df_distributor = dataset.groupby(["distributor"]).mean()["box_off_num"].reset_index()
df_distributor.sort_values(by="box_off_num", ascending=False, ignore_index=True, inplace=True)
df_distributor.head()

Unnamed: 0,distributor,box_off_num
0,쇼박스,3332954.0
1,아이필름코퍼레이션,3117859.0
2,영구아트무비,2541603.0
3,CJ엔터테인먼트,2209296.0
4,NEW,1939060.0


In [29]:
fig = px.bar(df_distributor, x="distributor", y="box_off_num")
fig.update_layout(title="배급사 x 평균 관객수", template="simple_white")
fig.show()

In [30]:
fig = px.histogram(df_distributor["box_off_num"])
fig.show()

In [31]:
# 평균 관객수 50만명을 넘지 못하는 배급사 drop
df_distributor_over_10k = df_distributor[df_distributor["box_off_num"]>500000]
fig = px.histogram(df_distributor_over_10k["box_off_num"])
fig.update_layout(title="배급사별 평균 관객수 분포 (50만 이상)", template="simple_white")
fig.show()

In [32]:
# 배급사에 따른 encoding
encode_distributor = []
one_million = 1000000
for i in df_distributor.index:
  # 100만 초과
  if  df_distributor.loc[i, "box_off_num"]> one_million:
    
    if df_distributor.loc[i, "box_off_num"]> one_million * 3:
      rank = 0
    
    elif df_distributor.loc[i, "box_off_num"]> one_million * 2:
      rank = 1
    
    else:
      rank = 2
    
  # 100만 이하
  else:
    if df_distributor.loc[i, "box_off_num"] > one_million / 2:
      rank = 3
  
    elif df_distributor.loc[i, "box_off_num"] > one_million / 10:
      rank = 4
    
    else:
      rank = 5
  
  encode_distributor.append(rank)

In [33]:
df_distributor["encode_distributor"] = encode_distributor

In [34]:
dataset = pd.merge(
    left=dataset, 
    right=df_distributor[["distributor", "encode_distributor"]], 
    on="distributor", 
    how="left"
)

In [35]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=dataset["encode_distributor"], y=dataset["box_off_num"], mode="markers", name="영화별 관객수"))
temp = dataset.groupby(["encode_distributor"]).mean()["box_off_num"].reset_index()
fig.add_trace(go.Scatter(x=temp["encode_distributor"], y=temp["box_off_num"], mode="lines", name="평균 관객수"))
fig.update_layout(title="배급사 encoding X 관객수")
fig.update_xaxes(title="배급사 encoding")
fig.update_yaxes(title="관객수")
fig.show()

In [36]:
dataset.head()

Unnamed: 0,distributor,genre,release_time,time,screening_rat,director,dir_prev_num,num_staff,num_actor,box_off_num,title_length,encode_distributor
0,롯데엔터테인먼트,액션,2012-11-22,96,청소년 관람불가,조병옥,0,91,2,23398,6,2
1,쇼박스,느와르,2015-11-19,130,청소년 관람불가,우민호,2,387,3,7072501,4,0
2,쇼박스,액션,2013-06-05,123,15세 관람가,장철수,4,343,4,6959083,9,0
3,NEW,코미디,2012-07-12,101,전체 관람가,구자홍,2,20,6,217866,8,2
4,쇼박스,코미디,2010-11-04,108,15세 관람가,신근호,1,251,2,483387,4,0


## (3) genre 변수

In [37]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=dataset["genre"], y=dataset["box_off_num"], mode="markers", name="관객수"))

genre_mean = dataset.groupby(["genre"]).mean()["box_off_num"].reset_index()
fig.add_trace(go.Scatter(x=genre_mean["genre"], y=genre_mean["box_off_num"], mode="markers", name="평균관객수"))

In [38]:
genre_mean["box_off_num"] = genre_mean["box_off_num"].astype(int)
genre_mean.sort_values(by="box_off_num", ascending=False, ignore_index=True, inplace=True)

In [39]:
genre_mean.head()

Unnamed: 0,genre,box_off_num
0,느와르,2263695
1,액션,2203974
2,SF,1788345
3,코미디,1193914
4,드라마,625689


In [40]:
px.histogram(genre_mean["box_off_num"], nbins=10, title="장르평균 관객수의 분포")

In [41]:
half_million = 500000
print(half_million * 3)

1500000


In [42]:
# 배급사에 따른 encoding
encode_genre = []
# million = 100만
half_million = 500000
for i in genre_mean.index:
  # 100만 초과
  if  genre_mean.loc[i, "box_off_num"]> half_million:
    
    if genre_mean.loc[i, "box_off_num"]> half_million * 4:
      rank = 0
    
    elif genre_mean.loc[i, "box_off_num"]> half_million * 3:
      rank = 1
    
    elif genre_mean.loc[i, "box_off_num"]> half_million * 2:
      rank = 2
    
    else:
      rank = 3
    
  # 100만 이하
  else:
    if genre_mean.loc[i, "box_off_num"] > half_million / 2:
      rank = 4

    else:
      rank = 5
  
  encode_genre.append(rank)

In [43]:
genre_mean["encode_genre"] = encode_genre

In [44]:
dataset = pd.merge(
    left=dataset,
    right=genre_mean[["genre", "encode_genre"]],
    on="genre",
    how="left"
)

In [45]:
dataset.head()

Unnamed: 0,distributor,genre,release_time,time,screening_rat,director,dir_prev_num,num_staff,num_actor,box_off_num,title_length,encode_distributor,encode_genre
0,롯데엔터테인먼트,액션,2012-11-22,96,청소년 관람불가,조병옥,0,91,2,23398,6,2,0
1,쇼박스,느와르,2015-11-19,130,청소년 관람불가,우민호,2,387,3,7072501,4,0,0
2,쇼박스,액션,2013-06-05,123,15세 관람가,장철수,4,343,4,6959083,9,0,0
3,NEW,코미디,2012-07-12,101,전체 관람가,구자홍,2,20,6,217866,8,2,2
4,쇼박스,코미디,2010-11-04,108,15세 관람가,신근호,1,251,2,483387,4,0,2


In [46]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=dataset["encode_genre"], y=dataset["box_off_num"], mode="markers", name="관객수", hovertemplate=dataset["genre"]))

temp = dataset.groupby(["encode_genre"]).mean()["box_off_num"].reset_index()
fig.add_trace(go.Scatter(x=temp["encode_genre"], y=temp["box_off_num"], mode="lines", name="평균관객수"))
fig.show()

##(4) release time 변수

In [47]:
dataset["release_time"] = pd.to_datetime(dataset["release_time"])
dataset["release_time_year"] = dataset["release_time"].dt.year
dataset["release_time_month"] = dataset["release_time"].dt.month

>> 계절별로만 특징이 보임

In [48]:
dataset["encode_season"] = dataset["release_time_month"].apply(lambda x: "spring/fall" if x==3 or x==4 or x==5 or x==9 or x==10 else ("summer" if x==6 or x==7 or x==8 else ("winter")))


In [49]:
dataset.head()

Unnamed: 0,distributor,genre,release_time,time,screening_rat,director,dir_prev_num,num_staff,num_actor,box_off_num,title_length,encode_distributor,encode_genre,release_time_year,release_time_month,encode_season
0,롯데엔터테인먼트,액션,2012-11-22,96,청소년 관람불가,조병옥,0,91,2,23398,6,2,0,2012,11,winter
1,쇼박스,느와르,2015-11-19,130,청소년 관람불가,우민호,2,387,3,7072501,4,0,0,2015,11,winter
2,쇼박스,액션,2013-06-05,123,15세 관람가,장철수,4,343,4,6959083,9,0,0,2013,6,summer
3,NEW,코미디,2012-07-12,101,전체 관람가,구자홍,2,20,6,217866,8,2,2,2012,7,summer
4,쇼박스,코미디,2010-11-04,108,15세 관람가,신근호,1,251,2,483387,4,0,2,2010,11,winter


In [50]:
fig = go.Figure()
for season in dataset["encode_season"].unique():
  fig.add_trace(go.Violin(x=dataset["encode_season"][dataset["encode_season"]==season], y=dataset["box_off_num"][dataset["encode_season"]==season], box_visible=True, points="all"))

fig.show()

In [51]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=dataset["encode_season"], y=dataset["box_off_num"], mode="markers", name="관객수"))
temp = dataset.groupby(["encode_season"]).mean()["box_off_num"].reset_index()
fig.add_trace(go.Scatter(x=temp["encode_season"], y=temp["box_off_num"], mode="lines", name="평균 관객수"))
fig.show()

In [52]:
temp

Unnamed: 0,encode_season,box_off_num
0,spring/fall,457376.826087
1,summer,947644.774436
2,winter,903854.931937


In [53]:
onehot_season = pd.get_dummies(dataset["encode_season"])
dataset = pd.concat([dataset, onehot_season], axis=1)

In [54]:
dataset.drop(columns={"encode_season"}, inplace=True)

In [55]:
dataset.head()

Unnamed: 0,distributor,genre,release_time,time,screening_rat,director,dir_prev_num,num_staff,num_actor,box_off_num,title_length,encode_distributor,encode_genre,release_time_year,release_time_month,spring/fall,summer,winter
0,롯데엔터테인먼트,액션,2012-11-22,96,청소년 관람불가,조병옥,0,91,2,23398,6,2,0,2012,11,0,0,1
1,쇼박스,느와르,2015-11-19,130,청소년 관람불가,우민호,2,387,3,7072501,4,0,0,2015,11,0,0,1
2,쇼박스,액션,2013-06-05,123,15세 관람가,장철수,4,343,4,6959083,9,0,0,2013,6,0,1,0
3,NEW,코미디,2012-07-12,101,전체 관람가,구자홍,2,20,6,217866,8,2,2,2012,7,0,1,0
4,쇼박스,코미디,2010-11-04,108,15세 관람가,신근호,1,251,2,483387,4,0,2,2010,11,0,0,1


## (5) screening rate 변수

In [56]:
fig = go.Figure()

fig.add_trace(
    go.Scatter(x=dataset["screening_rat"], y=dataset["box_off_num"], mode="markers")
)

temp = dataset.groupby(["screening_rat"]).mean()["box_off_num"].reset_index()
fig.add_trace(
    go.Scatter(x=temp["screening_rat"], y=temp["box_off_num"], mode="markers")
)

fig.show()

In [57]:
temp

Unnamed: 0,screening_rat,box_off_num
0,12세 관람가,844980.9
1,15세 관람가,1247519.0
2,전체 관람가,135100.5
3,청소년 관람불가,364181.3


In [58]:
dataset["encode_screening_rat"] = dataset["screening_rat"].apply(lambda x: 0 if x=="15세 관람가" else (1 if x=="12세 관람가" else (2 if x == "청소년 관람불가" else 3)))

In [59]:
dataset.head()

Unnamed: 0,distributor,genre,release_time,time,screening_rat,director,dir_prev_num,num_staff,num_actor,box_off_num,title_length,encode_distributor,encode_genre,release_time_year,release_time_month,spring/fall,summer,winter,encode_screening_rat
0,롯데엔터테인먼트,액션,2012-11-22,96,청소년 관람불가,조병옥,0,91,2,23398,6,2,0,2012,11,0,0,1,2
1,쇼박스,느와르,2015-11-19,130,청소년 관람불가,우민호,2,387,3,7072501,4,0,0,2015,11,0,0,1,2
2,쇼박스,액션,2013-06-05,123,15세 관람가,장철수,4,343,4,6959083,9,0,0,2013,6,0,1,0,0
3,NEW,코미디,2012-07-12,101,전체 관람가,구자홍,2,20,6,217866,8,2,2,2012,7,0,1,0,3
4,쇼박스,코미디,2010-11-04,108,15세 관람가,신근호,1,251,2,483387,4,0,2,2010,11,0,0,1,0


---

##(6) director 변수
데이터 셋의 기간 동안 감독의 작품수로 encoding 하려고 했으나, \\
dir_prev_num이라는 전작품 갯수를 나태내는 칼럼이 이미 존재하기 때문에, \\
진행하지 않음

In [60]:
px.histogram(dataset["dir_prev_num"])

In [61]:
temp =dataset.groupby(["director"]).count()["dir_prev_num"].reset_index()
px.histogram(temp["dir_prev_num"])

---

## (7)screening time 변수
영화의 길이에 따라 구간을 나누어 인코딩함

In [62]:
dataset.head()

Unnamed: 0,distributor,genre,release_time,time,screening_rat,director,dir_prev_num,num_staff,num_actor,box_off_num,title_length,encode_distributor,encode_genre,release_time_year,release_time_month,spring/fall,summer,winter,encode_screening_rat
0,롯데엔터테인먼트,액션,2012-11-22,96,청소년 관람불가,조병옥,0,91,2,23398,6,2,0,2012,11,0,0,1,2
1,쇼박스,느와르,2015-11-19,130,청소년 관람불가,우민호,2,387,3,7072501,4,0,0,2015,11,0,0,1,2
2,쇼박스,액션,2013-06-05,123,15세 관람가,장철수,4,343,4,6959083,9,0,0,2013,6,0,1,0,0
3,NEW,코미디,2012-07-12,101,전체 관람가,구자홍,2,20,6,217866,8,2,2,2012,7,0,1,0,3
4,쇼박스,코미디,2010-11-04,108,15세 관람가,신근호,1,251,2,483387,4,0,2,2010,11,0,0,1,0


---

In [63]:
px.histogram(dataset["time"])

In [64]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=dataset["time"], y=dataset["box_off_num"], mode="markers"))
fig.show()

In [65]:
dataset["encode_time"] = dataset["time"].apply(lambda x: 2 if x > 100 and x <= 120 else (0 if x > 120 and x <= 140 else (1 if x > 140 else 3)))

In [66]:
dataset.groupby("encode_time").mean()["box_off_num"].reset_index()

Unnamed: 0,encode_time,box_off_num
0,0,3128363.0
1,1,1704565.0
2,2,799288.9
3,3,74920.44


---

In [67]:
dataset.head()

Unnamed: 0,distributor,genre,release_time,time,screening_rat,director,dir_prev_num,num_staff,num_actor,box_off_num,title_length,encode_distributor,encode_genre,release_time_year,release_time_month,spring/fall,summer,winter,encode_screening_rat,encode_time
0,롯데엔터테인먼트,액션,2012-11-22,96,청소년 관람불가,조병옥,0,91,2,23398,6,2,0,2012,11,0,0,1,2,3
1,쇼박스,느와르,2015-11-19,130,청소년 관람불가,우민호,2,387,3,7072501,4,0,0,2015,11,0,0,1,2,0
2,쇼박스,액션,2013-06-05,123,15세 관람가,장철수,4,343,4,6959083,9,0,0,2013,6,0,1,0,0,0
3,NEW,코미디,2012-07-12,101,전체 관람가,구자홍,2,20,6,217866,8,2,2,2012,7,0,1,0,3,2
4,쇼박스,코미디,2010-11-04,108,15세 관람가,신근호,1,251,2,483387,4,0,2,2010,11,0,0,1,0,2


##2-3) MIN/MAX/MEAN/STD 추가

In [68]:
# distributor_df = dataset.groupby(['distributor'])['box_off_num'].agg(['mean','min','max','std']).reset_index()
# distributor_df.columns = ['distributor', 'distributor_mean', 'distributor_min', 'distributor_max', 'distributor_std']

# genre_df = dataset.groupby(['genre'])['box_off_num'].agg(['mean','min','max','std']).reset_index()
# genre_df.columns = ['genre', 'genre_mean', 'genre_min', 'genre_max', 'genre_std']

# screening_rat_df = dataset.groupby(['screening_rat'])['box_off_num'].agg(['mean','min','max','std']).reset_index()
# screening_rat_df.columns = ['screening_rat', 'screening_rat_mean', 'screening_rat_min', 'screening_rat_max', 'screening_rat_std']

# dataset = dataset.merge(distributor_df, on="distributor", how= "left").merge(genre_df, on = "genre", how= "left").merge(screening_rat_df, on="screening_rat", how="left")

In [69]:
dataset.drop(['distributor','genre','release_time','screening_rat','director'],axis=1,inplace=True)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 600 entries, 0 to 599
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   time                  600 non-null    int64
 1   dir_prev_num          600 non-null    int64
 2   num_staff             600 non-null    int64
 3   num_actor             600 non-null    int64
 4   box_off_num           600 non-null    int64
 5   title_length          600 non-null    int64
 6   encode_distributor    600 non-null    int64
 7   encode_genre          600 non-null    int64
 8   release_time_year     600 non-null    int64
 9   release_time_month    600 non-null    int64
 10  spring/fall           600 non-null    uint8
 11  summer                600 non-null    uint8
 12  winter                600 non-null    uint8
 13  encode_screening_rat  600 non-null    int64
 14  encode_time           600 non-null    int64
dtypes: int64(12), uint8(3)
memory usage: 62.7 KB


In [70]:
dataset.to_csv("dataset.csv", index=False)

In [71]:
dataset

Unnamed: 0,time,dir_prev_num,num_staff,num_actor,box_off_num,title_length,encode_distributor,encode_genre,release_time_year,release_time_month,spring/fall,summer,winter,encode_screening_rat,encode_time
0,96,0,91,2,23398,6,2,0,2012,11,0,0,1,2,3
1,130,2,387,3,7072501,4,0,0,2015,11,0,0,1,2,0
2,123,4,343,4,6959083,9,0,0,2013,6,0,1,0,0,0
3,101,2,20,6,217866,8,2,2,2012,7,0,1,0,3,2
4,108,1,251,2,483387,4,0,2,2010,11,0,0,1,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,111,1,510,7,1475091,2,2,3,2014,8,0,1,0,2,2
596,127,1,286,6,1716438,4,0,3,2013,3,1,0,0,0,0
597,99,0,123,4,2475,5,5,3,2010,9,1,0,0,2,3
598,102,0,431,4,2192525,6,1,0,2015,5,1,0,0,0,2


##SPLIT DATA

In [130]:
dataset.head()

Unnamed: 0,time,dir_prev_num,num_staff,num_actor,box_off_num,title_length,encode_distributor,encode_genre,release_time_year,release_time_month,spring/fall,summer,winter,encode_screening_rat,encode_time
0,96,0,91,2,23398,6,2,0,2012,11,0,0,1,2,3
1,130,2,387,3,7072501,4,0,0,2015,11,0,0,1,2,0
2,123,4,343,4,6959083,9,0,0,2013,6,0,1,0,0,0
3,101,2,20,6,217866,8,2,2,2012,7,0,1,0,3,2
4,108,1,251,2,483387,4,0,2,2010,11,0,0,1,0,2


In [131]:
#split the data set into independent (X) and dependent (Y) data sets
X = dataset[[c for c in dataset.columns if c not in ['box_off_num','release_time_year','release_time_month']]]
y = dataset['box_off_num']

#split the data qet into 75% training and 25% testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 1)

#scale the data (feature scaling)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [84]:
X_train.shape

(450, 12)

In [85]:
X_test.shape

(150, 12)

> #1 Linear Regression

In [76]:
# Model fit
from sklearn.linear_model import LinearRegression
lin_reg=LinearRegression()
lin_reg.fit(X_train,y_train)

LinearRegression()

In [77]:
#RMSE
import numpy as np
from sklearn.metrics import mean_squared_error
y_pred=lin_reg.predict(X_test)
lin_mse=mean_squared_error(y_test,y_pred)
lin_rmse=np.sqrt(lin_mse)
lin_rmse

1291826.5489492777

>#2 Ridge Regression

In [78]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

ridge_reg = Ridge(alpha=1, solver="cholesky", random_state=42)
ridge_reg.fit(X_train,y_train)  
y_pred=ridge_reg.predict(X_test)       
rid_mse=mean_squared_error(y_test,y_pred)
rid_rmse=np.sqrt(rid_mse)
rid_rmse                        

1291197.7173211484

> #3 Lasso

In [79]:
from sklearn import linear_model
lasso_reg=linear_model.Lasso(alpha=0.1)
lasso_reg.fit(X_train,y_train)
y_pred=lasso_reg.predict(X_test)
lasso_mse=mean_squared_error(y_test,y_pred)
lasso_rmse=np.sqrt(lasso_mse)
lasso_rmse                        

1291826.4566800839

>#4Polynomial

In [82]:
from sklearn.preprocessing import PolynomialFeatures
poly_features=PolynomialFeatures(degree=3,include_bias=False)
X_poly=poly_features.fit_transform(X_train)
poly_reg=LinearRegression()
poly_reg.fit(X_poly,y_train)
y_pred=poly_reg.predict(X_test)
poly_mse=mean_squared_error(y_test,y_pred)
poly_rmse=np.sqrt(poly_mse)
poly_rmse   

6429671.1788219465

> #5 RandomForestRegressor

In [86]:
from sklearn.ensemble import RandomForestRegressor
rand_reg=RandomForestRegressor(random_state=739).fit(X_train,y_train)
y_pred=rand_reg.predict(X_test)
rand_mse=mean_squared_error(y_test,y_pred)
rand_rmse=np.sqrt(rand_mse)
rand_rmse

1483319.6532859276

> #6 Fine-tune of RandomForestRegressor

In [136]:
from sklearn.model_selection import GridSearchCV
param_grid=[{'n_estimators': [250, 300,400], 'max_features': [1,2,3,5,10]}]

reg=RandomForestRegressor(random_state=79)

grid_search=GridSearchCV(reg,param_grid,cv=5,scoring='neg_mean_squared_error',return_train_score=True)
grid_search.fit(X_train,y_train)

GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=79),
             param_grid=[{'max_features': [1, 2, 3, 5, 10],
                          'n_estimators': [250, 300, 400]}],
             return_train_score=True, scoring='neg_mean_squared_error')

In [133]:
grid_search.best_params_

{'max_features': 3, 'n_estimators': 300}

In [134]:
reg=grid_search.best_estimator_

In [135]:
from sklearn.metrics import mean_squared_error
y_pred=reg.predict(X_test)
grid_mse=mean_squared_error(y_test,y_pred)
grid_rmse=np.sqrt(grid_mse)
grid_rmse

1252879.4284884077

> #7 SVM Regression

In [92]:
from sklearn.svm import LinearSVR

svm_reg=LinearSVR(epsilon=1.5).fit(X_train,y_train)


In [93]:
y_pred=svm_reg.predict(X_test)
SVM_mse=mean_squared_error(y_test,y_pred)
SVM_rmse=np.sqrt(SVM_mse)
SVM_rmse

1654302.7391579829

> #GradientBoostReggressor

In [94]:
from sklearn.model_selection import KFold
kf=KFold(n_splits=10,shuffle=True,random_state=10)

In [95]:
from sklearn.ensemble import GradientBoostingRegressor
gbm=GradientBoostingRegressor(random_state=33).fit(X_train,y_train)


In [96]:
y_pred=gbm.predict(X_test)
gbm_mse=mean_squared_error(y_test,y_pred)
gbm_rmse=np.sqrt(rand_mse)
gbm_rmse

1483319.6532859276

> #LGBMRegressor

In [97]:
from lightgbm import LGBMRegressor
lgbm=LGBMRegressor(random_state=30).fit(X_train,y_train)
y_pred=lgbm.predict(X_test)
lgbm_mse=mean_squared_error(y_test,y_pred)
lgbm_rmse=np.sqrt(lgbm_mse)
lgbm_rmse

1295381.9393187768

In [98]:
from xgboost import XGBRegressor
xgb=XGBRegressor(random_state=23).fit(X_train,y_train)
y_pred=xgb.predict(X_test)
xgb_mse=mean_squared_error(y_test,y_pred)
xgb_rmse=np.sqrt(xgb_mse)
xgb_rmse




1731608.3751323211

In [100]:
dataset.head()

Unnamed: 0,time,dir_prev_num,num_staff,num_actor,box_off_num,title_length,encode_distributor,encode_genre,release_time_year,release_time_month,spring/fall,summer,winter,encode_screening_rat,encode_time
0,96,0,91,2,23398,6,2,0,2012,11,0,0,1,2,3
1,130,2,387,3,7072501,4,0,0,2015,11,0,0,1,2,0
2,123,4,343,4,6959083,9,0,0,2013,6,0,1,0,0,0
3,101,2,20,6,217866,8,2,2,2012,7,0,1,0,3,2
4,108,1,251,2,483387,4,0,2,2010,11,0,0,1,0,2


In [102]:
dataset_2 = dataset.copy()
dataset_2["encode_distributor"] = dataset_2["encode_distributor"] + 1
maxi = dataset_2["encode_distributor"].max()
dataset_2["encode_distributor"] = dataset_2["encode_distributor"].apply(lambda x: maxi + 1 - x)

In [103]:
dataset_2.head()

Unnamed: 0,time,dir_prev_num,num_staff,num_actor,box_off_num,title_length,encode_distributor,encode_genre,release_time_year,release_time_month,spring/fall,summer,winter,encode_screening_rat,encode_time
0,96,0,91,2,23398,6,4,0,2012,11,0,0,1,2,3
1,130,2,387,3,7072501,4,6,0,2015,11,0,0,1,2,0
2,123,4,343,4,6959083,9,6,0,2013,6,0,1,0,0,0
3,101,2,20,6,217866,8,4,2,2012,7,0,1,0,3,2
4,108,1,251,2,483387,4,6,2,2010,11,0,0,1,0,2


In [104]:

dataset_2["encode_genre"] = dataset_2["encode_genre"] + 1
maxi = dataset_2["encode_genre"].max()
dataset_2["encode_genre"] = dataset_2["encode_genre"].apply(lambda x: maxi + 1 - x)

In [105]:

dataset_2["encode_screening_rat"] = dataset_2["encode_screening_rat"] + 1
maxi = dataset_2["encode_screening_rat"].max()
dataset_2["encode_screening_rat"] = dataset_2["encode_screening_rat"].apply(lambda x: maxi + 1 - x)

In [106]:

dataset_2["encode_time"] = dataset_2["encode_time"] + 1
maxi = dataset_2["encode_time"].max()
dataset_2["encode_time"] = dataset_2["encode_time"].apply(lambda x: maxi + 1 - x)

In [107]:
dataset_2

Unnamed: 0,time,dir_prev_num,num_staff,num_actor,box_off_num,title_length,encode_distributor,encode_genre,release_time_year,release_time_month,spring/fall,summer,winter,encode_screening_rat,encode_time
0,96,0,91,2,23398,6,4,6,2012,11,0,0,1,2,1
1,130,2,387,3,7072501,4,6,6,2015,11,0,0,1,2,4
2,123,4,343,4,6959083,9,6,6,2013,6,0,1,0,4,4
3,101,2,20,6,217866,8,4,4,2012,7,0,1,0,1,2
4,108,1,251,2,483387,4,6,4,2010,11,0,0,1,4,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,111,1,510,7,1475091,2,4,3,2014,8,0,1,0,2,2
596,127,1,286,6,1716438,4,6,3,2013,3,1,0,0,4,4
597,99,0,123,4,2475,5,1,3,2010,9,1,0,0,2,1
598,102,0,431,4,2192525,6,5,6,2015,5,1,0,0,4,2


In [108]:
dataset

Unnamed: 0,time,dir_prev_num,num_staff,num_actor,box_off_num,title_length,encode_distributor,encode_genre,release_time_year,release_time_month,spring/fall,summer,winter,encode_screening_rat,encode_time
0,96,0,91,2,23398,6,2,0,2012,11,0,0,1,2,3
1,130,2,387,3,7072501,4,0,0,2015,11,0,0,1,2,0
2,123,4,343,4,6959083,9,0,0,2013,6,0,1,0,0,0
3,101,2,20,6,217866,8,2,2,2012,7,0,1,0,3,2
4,108,1,251,2,483387,4,0,2,2010,11,0,0,1,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,111,1,510,7,1475091,2,2,3,2014,8,0,1,0,2,2
596,127,1,286,6,1716438,4,0,3,2013,3,1,0,0,0,0
597,99,0,123,4,2475,5,5,3,2010,9,1,0,0,2,3
598,102,0,431,4,2192525,6,1,0,2015,5,1,0,0,0,2


In [110]:
dataset_2.corr()

Unnamed: 0,time,dir_prev_num,num_staff,num_actor,box_off_num,title_length,encode_distributor,encode_genre,release_time_year,release_time_month,spring/fall,summer,winter,encode_screening_rat,encode_time
time,1.0,0.306727,0.623205,0.114153,0.441452,-0.235482,0.540888,0.394962,-0.062575,-0.023969,-0.077354,-0.043231,0.12131,0.298326,0.817848
dir_prev_num,0.306727,1.0,0.450706,0.014006,0.259674,-0.076599,0.392454,0.267112,0.132621,0.034019,-0.014028,-0.025779,0.037995,0.232279,0.308107
num_staff,0.623205,0.450706,1.0,0.077871,0.544265,-0.191304,0.685334,0.472244,-0.032891,-0.002841,-0.064019,-0.023994,0.089889,0.324306,0.589856
num_actor,0.114153,0.014006,0.077871,1.0,0.111179,0.018571,0.068556,0.034781,-0.098869,-0.015063,-0.031514,0.068953,-0.027766,0.044571,0.115322
box_off_num,0.441452,0.259674,0.544265,0.111179,1.0,-0.085964,0.524165,0.31495,-0.002497,0.019104,-0.126737,0.069967,0.07321,0.23765,0.520013
title_length,-0.235482,-0.076599,-0.191304,0.018571,-0.085964,1.0,-0.122715,-0.19776,0.097394,0.057453,-0.069736,0.042665,0.036568,-0.167968,-0.191149
encode_distributor,0.540888,0.392454,0.685334,0.068556,0.524165,-0.122715,1.0,0.383996,-0.064042,-0.016479,-0.046694,0.003451,0.046882,0.319854,0.530328
encode_genre,0.394962,0.267112,0.472244,0.034781,0.31495,-0.19776,0.383996,1.0,-0.03075,-0.016704,-0.08455,0.074834,0.023734,0.285549,0.3193
release_time_year,-0.062575,0.132621,-0.032891,-0.098869,-0.002497,0.097394,-0.064042,-0.03075,1.0,0.046627,-0.016548,0.005747,0.012581,-0.04756,-0.017218
release_time_month,-0.023969,0.034019,-0.002841,-0.015063,0.019104,0.057453,-0.016479,-0.016704,0.046627,1.0,-0.163876,0.028922,0.149544,-0.029604,-0.008666


In [111]:
dataset_3 = dataset_2.drop(columns={"time", "dir_prev_num", "encode_genre", "spring/fall", "winter", "summer", "encode_screening_rat"})

In [112]:
#split the data set into independent (X) and dependent (Y) data sets
X = dataset_3[[c for c in dataset_3.columns if c not in ['box_off_num']]]
y = dataset_3['box_off_num']

#split the data qet into 75% training and 25% testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 1)

#scale the data (feature scaling)
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
# sc = StandardScaler()
sc = MinMaxScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [113]:
# Model fit
from sklearn.linear_model import LinearRegression
lin_reg=LinearRegression()
lin_reg.fit(X_train,y_train)

LinearRegression()

In [114]:
#RMSE
import numpy as np
from sklearn.metrics import mean_squared_error
y_pred=lin_reg.predict(X_test)
lin_mse=mean_squared_error(y_test,y_pred)
lin_rmse=np.sqrt(lin_mse)
lin_rmse

1311745.7301215145

In [115]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

ridge_reg = Ridge(alpha=1, solver="cholesky", random_state=42)
ridge_reg.fit(X_train,y_train)  
y_pred=ridge_reg.predict(X_test)       
rid_mse=mean_squared_error(y_test,y_pred)
rid_rmse=np.sqrt(rid_mse)
rid_rmse                        

1295711.4096162063

In [116]:
from sklearn import linear_model
lasso_reg=linear_model.Lasso(alpha=0.1)
lasso_reg.fit(X_train,y_train)
y_pred=lasso_reg.predict(X_test)
lasso_mse=mean_squared_error(y_test,y_pred)
lasso_rmse=np.sqrt(lasso_mse)
lasso_rmse                        

1311745.1707194233

In [117]:
# from sklearn.preprocessing import PolynomialFeatures
# poly_features=PolynomialFeatures(degree=3,include_bias=False)
# X_poly=poly_features.fit_transform(X_train)
# X_test=poly_features.fit_transform(X_test)
# poly_reg=LinearRegression()
# poly_reg.fit(X_poly,y_train)
# y_pred=poly_reg.predict(X_test)
# poly_mse=mean_squared_error(y_test,y_pred)
# poly_rmse=np.sqrt(poly_mse)
# poly_rmse   

In [118]:
from sklearn.ensemble import RandomForestRegressor
rand_reg=RandomForestRegressor(random_state=739).fit(X_train,y_train)
y_pred=rand_reg.predict(X_test)
rand_mse=mean_squared_error(y_test,y_pred)
rand_rmse=np.sqrt(rand_mse)
rand_rmse

1522908.524517347

In [119]:
from sklearn.model_selection import GridSearchCV
param_grid=[{'n_estimators': [3, 100, 200], 'max_features': [5,10,20]}]

reg=RandomForestRegressor(random_state=79)

grid_search=GridSearchCV(reg,param_grid,cv=5,scoring='neg_mean_squared_error',return_train_score=True)
grid_search.fit(X_train,y_train)



30 fits failed out of a total of 45.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/ensemble/_forest.py", line 467, in fit
    for i, t in enumerate(trees)
  File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 1043, in __call__
    if self.dispatch_one_batch(iterator):
  File "/usr/local/lib/python3.7/dist-packages/joblib/parallel.py", line 861, in dispatch_one_batch
    self._dispatch(tasks)
  File "/usr/local/lib/python3.7

GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=79),
             param_grid=[{'max_features': [5, 10, 20],
                          'n_estimators': [3, 100, 200]}],
             return_train_score=True, scoring='neg_mean_squared_error')

In [120]:
grid_search.best_params_

{'max_features': 5, 'n_estimators': 100}

In [121]:
reg=grid_search.best_estimator_

In [122]:
from sklearn.metrics import mean_squared_error
y_pred=reg.predict(X_test)
grid_mse=mean_squared_error(y_test,y_pred)
grid_rmse=np.sqrt(grid_mse)
grid_rmse

1593902.449856775

In [123]:
from sklearn.svm import LinearSVR

svm_reg=LinearSVR(epsilon=1.5).fit(X_train,y_train)


In [124]:
y_pred=svm_reg.predict(X_test)
SVM_mse=mean_squared_error(y_test,y_pred)
SVM_rmse=np.sqrt(SVM_mse)
SVM_rmse

1654287.2917367523

In [125]:
from sklearn.model_selection import KFold
kf=KFold(n_splits=10,shuffle=True,random_state=10)

In [126]:
from sklearn.ensemble import GradientBoostingRegressor
gbm=GradientBoostingRegressor(random_state=33).fit(X_train,y_train)


In [127]:
y_pred=gbm.predict(X_test)
gbm_mse=mean_squared_error(y_test,y_pred)
gbm_rmse=np.sqrt(rand_mse)
gbm_rmse

1522908.524517347

In [128]:
from lightgbm import LGBMRegressor
lgbm=LGBMRegressor(random_state=30).fit(X_train,y_train)
y_pred=lgbm.predict(X_test)
lgbm_mse=mean_squared_error(y_test,y_pred)
lgbm_rmse=np.sqrt(lgbm_mse)
lgbm_rmse

1266974.5150257403

In [129]:
from xgboost import XGBRegressor
xgb=XGBRegressor(random_state=23).fit(X_train,y_train)
y_pred=xgb.predict(X_test)
xgb_mse=mean_squared_error(y_test,y_pred)
xgb_rmse=np.sqrt(xgb_mse)
xgb_rmse




1692204.6784924509