# movie 데이터 메모리 용량 줄이기

줄이는 방법
1. 더 큰 메모리 구매 or cloud 활용
2. 파일 작게 자르기
3. 메모리를 적게 소비하는 구조 만들기
4. 디스크 형태의 구조 사용하기(dask)

메모리를 적게 소비하게 만들기 !!

In [13]:
import pandas as pd
import numpy as np

In [14]:
#영화 데이터 불러오기
movies = pd.read_csv('./data/data/movie.csv', nrows = 1000)
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 28 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   color                      997 non-null    object 
 1   director_name              982 non-null    object 
 2   num_critic_for_reviews     998 non-null    float64
 3   duration                   997 non-null    float64
 4   director_facebook_likes    982 non-null    float64
 5   actor_3_facebook_likes     998 non-null    float64
 6   actor_2_name               999 non-null    object 
 7   actor_1_facebook_likes     1000 non-null   float64
 8   gross                      965 non-null    float64
 9   genres                     1000 non-null   object 
 10  actor_1_name               1000 non-null   object 
 11  movie_title                1000 non-null   object 
 12  num_voted_users            1000 non-null   int64  
 13  cast_total_facebook_likes  1000 non-null   int64 

현재 218.9+ KB 메모리가 사용되는 중이고 float64 데이터 13개, int64 데이터 3개, object 데이터 12개가 포함됨

일단 사용하고 싶은 데이터만 활용해서 용량을 줄여보겠습니다

In [15]:
# 배우 이름들(object)과 배우 페이스북 좋아요(float64), 영화 페이스북 좋아요(int64)만 사용하겠습니다.

cols = ['actor_1_name','actor_2_name','actor_3_name','actor_1_facebook_likes','actor_2_facebook_likes','actor_3_facebook_likes',
        'movie_facebook_likes']

movies2 = pd.read_csv('./data/data/movie.csv', nrows=1000,
                       usecols=cols)

movies2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   actor_3_facebook_likes  998 non-null    float64
 1   actor_2_name            999 non-null    object 
 2   actor_1_facebook_likes  1000 non-null   float64
 3   actor_1_name            1000 non-null   object 
 4   actor_3_name            998 non-null    object 
 5   actor_2_facebook_likes  999 non-null    float64
 6   movie_facebook_likes    1000 non-null   int64  
dtypes: float64(3), int64(1), object(3)
memory usage: 54.8+ KB


54.8+ KB 사용으로 용량을 줄였습니다.

이번엔 float과 int 형태를 살펴보고 적절한 수치로 줄여보겠습니다.

In [16]:
movies2.head()

Unnamed: 0,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,actor_1_name,actor_3_name,actor_2_facebook_likes,movie_facebook_likes
0,855.0,Joel David Moore,1000.0,CCH Pounder,Wes Studi,936.0,33000
1,1000.0,Orlando Bloom,40000.0,Johnny Depp,Jack Davenport,5000.0,0
2,161.0,Rory Kinnear,11000.0,Christoph Waltz,Stephanie Sigman,393.0,85000
3,23000.0,Christian Bale,27000.0,Tom Hardy,Joseph Gordon-Levitt,23000.0,164000
4,,Rob Walker,131.0,Doug Walker,,12.0,0


In [21]:
movies2.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
actor_3_facebook_likes,998.0,1298.910822,2845.384663,0.0,307.25,562.0,854.0,23000.0
actor_1_facebook_likes,1000.0,10666.077,11241.500791,2.0,980.5,10000.0,15000.0,87000.0
actor_2_facebook_likes,999.0,3268.671672,5171.729257,0.0,555.5,871.0,3000.0,27000.0
movie_facebook_likes,1000.0,15931.707,31231.750349,0.0,0.0,606.0,19000.0,349000.0


In [18]:
# float은 16만 줘도 적당해 보인다.
np.finfo('float16')

finfo(resolution=0.001, min=-6.55040e+04, max=6.55040e+04, dtype=float16)

In [20]:
# 영화 좋아요 최대치가 34만이 넘으므로 int16보단 int32가 적절해 보인다.
np.iinfo('int32')

iinfo(min=-2147483648, max=2147483647, dtype=int32)

In [23]:
movies3 = pd.read_csv('./data/data/movie.csv', nrows = 1000,
                        dtype={
                            'actor_1_facebook_likes': np.float16,
                            'actor_2_facebook_likes': np.float16,
                            'actor_3_facebook_likes': np.float16,                            
                            'movie_facebook_likes': np.int32
                        },
                     usecols = cols)
movies3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   actor_3_facebook_likes  998 non-null    float16
 1   actor_2_name            999 non-null    object 
 2   actor_1_facebook_likes  1000 non-null   float16
 3   actor_1_name            1000 non-null   object 
 4   actor_3_name            998 non-null    object 
 5   actor_2_facebook_likes  999 non-null    float16
 6   movie_facebook_likes    1000 non-null   int32  
dtypes: float16(3), int32(1), object(3)
memory usage: 33.3+ KB


33.3+ KB 까지 줄였다.

In [24]:
movies3.describe(include='O')

Unnamed: 0,actor_2_name,actor_1_name,actor_3_name
count,999,1000,998
unique,695,427,823
top,Morgan Freeman,Will Smith,Steve Coogan
freq,11,17,7


고유값이 너무 다양해 object를 category로 전환하는건 적절치 않아 보인다.