# How to change the data type - use $ astype( )$
___
The syntax to change the data type is `column.astype('changed_type')`

<u>In order to see changes in your data type, you would need to overwrite it. <u/>

`column = column.astype('changed_type')`

```python
movies.Film = movies.Film.astype('category')

movies.Genre = movies.Genre.astype('category')

movies.Year = movies.Year.astype('category')
```    
__In R, categorical variables are called factors.__
       

# look at the categorical variables in that column - 
___
    
## Method1 - use $ unique( ) $

```python
movies.Year.unique() 
#output
[2009, 2008, 2010, 2007, 2011]
Categories (5, int64): [2007, 2008, 2009, 2010, 2011]
``` 
## Method2 - use `column.cat.categories`

```python
movies.Genre.cat.categories 
#output
Index(['Action', 'Adventure', 'Comedy', 'Drama', 'Horror', 'Romance',
       'Thriller'],
      dtype='object')
```    
**In R, these unique values are called levels**  
___    

In [2]:
import pandas as pd #so that we can import our data set
import os #so that we can change our working directory if needed

In [3]:
#check our current working directory
os.getcwd()

'/Users/rajanbawa/Documents/Python'

In [9]:
#import our data set
movies = pd.read_csv('movies.csv')

In [10]:
#exploring the data
movies.info() 
#we have 559 enteries and 6 colummns of which we have 4 integer types and 2 character types.


#Things we noticed
#year is also a int type. It would be good to have it as a categorical type

#column names have spaces and have special characters(% and $). That need to be removed.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 559 entries, 0 to 558
Data columns (total 6 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Film                       559 non-null    object
 1   Genre                      559 non-null    object
 2   Rotten Tomatoes Ratings %  559 non-null    int64 
 3   Audience Ratings %         559 non-null    int64 
 4   Budget (million $)         559 non-null    int64 
 5   Year of release            559 non-null    int64 
dtypes: int64(4), object(2)
memory usage: 26.3+ KB


In [15]:
#changing the column names - choose names that are easy to type and helps you identify them
movies.columns = ['Film', 'Genre', 'CriticRating', 'AudienceRating',
       'BudgetMillions', 'Year']


In [19]:
movies.head()

Unnamed: 0,Film,Genre,CriticRating,AudienceRating,BudgetMillions,Year
0,(500) Days of Summer,Comedy,87,81,8,2009
1,"10,000 B.C.",Adventure,9,44,105,2008
2,12 Rounds,Action,30,52,20,2009
3,127 Hours,Adventure,93,84,18,2010
4,17 Again,Comedy,55,70,20,2009


In [22]:
#the 2nd problem : #year is also a int type. It would be good to have it as a categorical type

#Also the Genre which is an object (Character vector) must also be a categorical variable.



In [25]:
movies.info()
#Film is an object type. Changing it to Category

#Do the same for Genre and year

movies.Film = movies.Film.astype('category')

movies.Genre = movies.Genre.astype('category')

movies.Year = movies.Year.astype('category')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 559 entries, 0 to 558
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   Film            559 non-null    category
 1   Genre           559 non-null    category
 2   CriticRating    559 non-null    int64   
 3   AudienceRating  559 non-null    int64   
 4   BudgetMillions  559 non-null    int64   
 5   Year            559 non-null    category
dtypes: category(3), int64(3)
memory usage: 36.5 KB


In [26]:
#in order to look at the categorical variables in that column
movies.Year.unique()

[2009, 2008, 2010, 2007, 2011]
Categories (5, int64): [2007, 2008, 2009, 2010, 2011]

In [27]:
movies.Genre.unique()

['Comedy', 'Adventure', 'Action', 'Horror', 'Drama', 'Romance', 'Thriller']
Categories (7, object): ['Action', 'Adventure', 'Comedy', 'Drama', 'Horror', 'Romance', 'Thriller']

In [28]:
movies.Film.unique()

['(500) Days of Summer ', '10,000 B.C.', '12 Rounds ', '127 Hours', '17 Again ', ..., 'Your Highness', 'Youth in Revolt', 'Zodiac', 'Zombieland ', 'Zookeeper']
Length: 559
Categories (559, object): ['(500) Days of Summer ', '10,000 B.C.', '12 Rounds ', '127 Hours', ..., 'Youth in Revolt', 'Zodiac', 'Zombieland ', 'Zookeeper']

In [29]:
movies.Genre.cat.categories 

Index(['Action', 'Adventure', 'Comedy', 'Drama', 'Horror', 'Romance',
       'Thriller'],
      dtype='object')

In [30]:
movies.describe() #we no longer have year as int variable.

Unnamed: 0,CriticRating,AudienceRating,BudgetMillions
count,559.0,559.0,559.0
mean,47.309481,58.744186,50.236136
std,26.413091,16.826887,48.731817
min,0.0,0.0,0.0
25%,25.0,47.0,20.0
50%,46.0,58.0,35.0
75%,70.0,72.0,65.0
max,97.0,96.0,300.0
