**DATA WRANGLING TUTORIAL FOR BEGINNERS**

Introduction:

*This is a basic tutorial for beginners who need a guide to cleaning up data before manipulating them. Although there are several types of tools and research sites, I developed this tutorial for my own use in order to make the path easier. Those who find it useful, feel free to consult it or please, let me know if something goes wrong.*

In [None]:
# At first, you need to check what kind of file you will explore.
# Both excel and csv files need to import Pandas library.
# Pandas is an open-source library that allows to you perform data manipulation and analysis in Python.

In [None]:
import pandas as pd

In [None]:
# If the document to be used is excel, you have to follow the example below:
# df = pd.read_excel('file')
# In this particular case, I'm using a csv file as follows:

In [None]:
netflix = pd.read_csv('netflix_titles.csv')

In [None]:
# Once the file has been imported, it is important to do an overview to know what the file is about and whats its dimensions are.

In [None]:
# head() is going to show the dataframe columns and the 5 first rows as follows:
netflix.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [None]:
# tail() is going to show the dataframe columns and the 5 last rows as follows:
netflix.tail()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."
8806,s8807,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...


In [None]:
# Checking the dataframe size by the .shape function:

netflix.shape

(8807, 12)

In [None]:
# to get a better overview of the data, one way to visualize it is through the function info()
# Through this framework, it is possible to extract various information. 
# For example, the number of rows and columns, the type of data, whether or not null values ​​exist. 
# From here, it is easy to realize that the file needs treatment so that it can be explored.
# As we can see, there are a total of 12 columns and 8807 rows and some null values.
# Also, it is always important to check which data types make up the dataframe. They can be of type int, str, float, boolean, among others. 
# To manipulate data, it is essential to know how to use it and even transform it if necessary.

netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [None]:
# We'll start using the function .columns to show the columns names as follows, so we can check what kind of data we're working with.
netflix.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [None]:
# It can be transformed to a list
netflix.columns.tolist()

['show_id',
 'type',
 'title',
 'director',
 'cast',
 'country',
 'date_added',
 'release_year',
 'rating',
 'duration',
 'listed_in',
 'description']

In [None]:
# We can change the columns name, as required:
# Tip: it is important never use accents or capital letters for ease of understanding
# If you don't need to change the name, just don't mention it.

netflix.rename(columns={'show_id':'id', 'listed_in': 'listed', 'description': 'descrip'}, inplace=True)
netflix.columns

Index(['id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed', 'descrip'],
      dtype='object')

In [None]:
# To check the number of rows, we can use de .index function:
netflix.index

RangeIndex(start=0, stop=8807, step=1)

In [None]:
# As shown earlier by the info function, the column "date_added" is in object type. 
# To transform it to a date type, the following function need be used to be used:

netflix['date_added'] = pd.to_datetime(netflix['date_added'])

In [None]:
# Checking the column date type 
netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            8807 non-null   object        
 1   type          8807 non-null   object        
 2   title         8807 non-null   object        
 3   director      6173 non-null   object        
 4   cast          7982 non-null   object        
 5   country       7976 non-null   object        
 6   date_added    8797 non-null   datetime64[ns]
 7   release_year  8807 non-null   int64         
 8   rating        8803 non-null   object        
 9   duration      8804 non-null   object        
 10  listed        8807 non-null   object        
 11  descrip       8807 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(10)
memory usage: 825.8+ KB


In [None]:
# In this case, it won't be necessary to use the "Id" and the "description" columns. So let's delete them using the 'drop' function
# In pandas, axis=0 represents rows (default) e axis=1 representa columns.
# 'inplace = True' means that the changes will be done in the original dataset.
netflix.drop(['id','descrip'], axis=1, inplace=True)
netflix.head(2)


Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed
0,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,2021-09-25,2020,PG-13,90 min,Documentaries
1,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries"


In [None]:
# Now we have a total of 10 columns.
netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   type          8807 non-null   object        
 1   title         8807 non-null   object        
 2   director      6173 non-null   object        
 3   cast          7982 non-null   object        
 4   country       7976 non-null   object        
 5   date_added    8797 non-null   datetime64[ns]
 6   release_year  8807 non-null   int64         
 7   rating        8803 non-null   object        
 8   duration      8804 non-null   object        
 9   listed        8807 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(8)
memory usage: 688.2+ KB


In [None]:
# After that, its important to understand wich are the null values, and if we have to remove it or replace them.
# As we can see, there are lots of directors, cast and countries missing. 
# There are also some date_added, rating and duration missing.
# Even if this information is not available, it is not feasible to simply delete the missing rows.
# As an option, we can replace these values.

netflix.isnull().sum()

type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed             0
dtype: int64

In [None]:
# The fillna function used below replaces NaN with unknown. 
# This way, the values ​​are no longer classified as null.

netflix['director'].fillna("Unknown", inplace = True) 

In [None]:
# We can do the same thing with the other 'Object' NaN values.
netflix['cast'].fillna("Unknown", inplace = True) 
netflix['country'].fillna("Unknown", inplace = True) 
netflix['rating'].fillna("Unknown", inplace = True) 
netflix['duration'].fillna("Unknown", inplace = True) 

In [None]:
netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   type          8807 non-null   object        
 1   title         8807 non-null   object        
 2   director      8807 non-null   object        
 3   cast          8807 non-null   object        
 4   country       8807 non-null   object        
 5   date_added    8797 non-null   datetime64[ns]
 6   release_year  8807 non-null   int64         
 7   rating        8807 non-null   object        
 8   duration      8807 non-null   object        
 9   listed        8807 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(8)
memory usage: 688.2+ KB


In [None]:
# Now, let's take a look at what kind of ratings were done for the Netflix movies and series, 
# and how many kinds of ratings there are.
netflix['rating'].nunique()

18

In [None]:
# All 18 types of ratings are listed below.
# As we can see, there are datas that don't bellong to the ratings, for example '84min'.
netflix['rating'].value_counts()

TV-MA       3207
TV-14       2160
TV-PG        863
R            799
PG-13        490
TV-Y7        334
TV-Y         307
PG           287
TV-G         220
NR            80
G             41
TV-Y7-FV       6
Unknown        4
NC-17          3
UR             3
74 min         1
84 min         1
66 min         1
Name: rating, dtype: int64

In [None]:
# It's necessary to drop these rows.
netflix.drop(netflix.index[netflix['rating'] == '74 min'], inplace = True)
netflix.drop(netflix.index[netflix['rating'] == '66 min'], inplace = True)
netflix.drop(netflix.index[netflix['rating'] == '84 min'], inplace = True)

In [None]:
# Checking 'Rating' column
netflix['rating'].value_counts()

TV-MA       3207
TV-14       2160
TV-PG        863
R            799
PG-13        490
TV-Y7        334
TV-Y         307
PG           287
TV-G         220
NR            80
G             41
TV-Y7-FV       6
Unknown        4
NC-17          3
UR             3
Name: rating, dtype: int64

In [None]:
# If we select movies or series by title using the nunique function, it shows 8804 diferent titles.
# As shown before, there are 8807 rows of titles.
# It means that some title might be duplicated and need to be verified.
netflix['title'].nunique()

8804

In [None]:
netflix['title'].value_counts()

Next in Fashion       1
Music Teacher         1
The Yard              1
The Defeated          1
Behzat Ç.             1
                     ..
Pizza, birra, faso    1
Tokyo Trial           1
Accomplice            1
Midnight Sun          1
Comedy High School    1
Name: title, Length: 8804, dtype: int64

In [None]:
# As we can see, the amount of entries does not agree with the dataframe index: 'Int64Index: 8804 entries, 0 to 8806'
# So when we try to find some data by iloc, the information doesn't match. Thus, it is necessary to reorganize the index '
netflix.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8804 entries, 0 to 8806
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   type          8804 non-null   object        
 1   title         8804 non-null   object        
 2   director      8804 non-null   object        
 3   cast          8804 non-null   object        
 4   country       8804 non-null   object        
 5   date_added    8794 non-null   datetime64[ns]
 6   release_year  8804 non-null   int64         
 7   rating        8804 non-null   object        
 8   duration      8804 non-null   object        
 9   listed        8804 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(8)
memory usage: 756.6+ KB


In [None]:
# Now, the index is correct.
netflix.reset_index(drop=True, inplace= True)

In [None]:
netflix.tail()

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed
8799,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,2019-11-20,2007,R,158 min,"Cult Movies, Dramas, Thrillers"
8800,TV Show,Zombie Dumb,Unknown,Unknown,Unknown,2019-07-01,2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies"
8801,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,2019-11-01,2009,R,88 min,"Comedies, Horror Movies"
8802,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,2020-01-11,2006,PG,88 min,"Children & Family Movies, Comedies"
8803,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,2019-03-02,2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals"


In [None]:
# If we want to check the Director by the index, using the 'iloc' function.
# This is only possible because the index has been reorganized before.
netflix.iloc[8799,2]

'David Fincher'

**Exploring the dataset's entertainment variety**

In [None]:
# If we want to check what kind of entertainment is in the dataframe
netflix['type'].unique()

array(['Movie', 'TV Show'], dtype=object)

In [None]:
# If we want to filter by Movie using loc function
Movies = netflix.loc[(netflix['type'] == 'Movie')]

In [None]:
Movies

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed
0,Movie,Dick Johnson Is Dead,Kirsten Johnson,Unknown,United States,2021-09-25,2020,PG-13,90 min,Documentaries
6,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",Unknown,2021-09-24,2021,PG,91 min,Children & Family Movies
7,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...",2021-09-24,1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies"
9,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,2021-09-24,2021,PG-13,104 min,"Comedies, Dramas"
12,Movie,Je Suis Karl,Christian Schwochow,"Luna Wedler, Jannis Niewöhner, Milan Peschel, ...","Germany, Czech Republic",2021-09-23,2021,TV-MA,127 min,"Dramas, International Movies"
...,...,...,...,...,...,...,...,...,...,...
8798,Movie,Zinzana,Majid Al Ansari,"Ali Suliman, Saleh Bakri, Yasa, Ali Al-Jabri, ...","United Arab Emirates, Jordan",2016-03-09,2015,TV-MA,96 min,"Dramas, International Movies, Thrillers"
8799,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,2019-11-20,2007,R,158 min,"Cult Movies, Dramas, Thrillers"
8801,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,2019-11-01,2009,R,88 min,"Comedies, Horror Movies"
8802,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,2020-01-11,2006,PG,88 min,"Children & Family Movies, Comedies"


In [None]:
# To filter a specific movie
Zodiac_movie = netflix.loc[netflix['title'] == 'Zodiac']

In [None]:
# There it is, the selection of the Zodiac movie by line, with all its columns
Zodiac_movie

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,listed
8799,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,2019-11-20,2007,R,158 min,"Cult Movies, Dramas, Thrillers"


In [None]:
# Let's suppose we want to check which country produces the most movies,
# but first, it's necessary to check the data. As we can see, some movies are co-productions.
Movies['country'].tolist()

['United States',
 'Unknown',
 'United States, Ghana, Burkina Faso, United Kingdom, Germany, Ethiopia',
 'United States',
 'Germany, Czech Republic',
 'Unknown',
 'Unknown',
 'Unknown',
 'Unknown',
 'Unknown',
 'India',
 'Unknown',
 'United States',
 'United States',
 'United States, India, France',
 'Unknown',
 'Unknown',
 'Unknown',
 'China, Canada, United States',
 'United States',
 'United States',
 'United States',
 'United States',
 'Unknown',
 'South Africa, United States, Japan',
 'United States',
 'Japan',
 'Japan',
 'Japan',
 'Japan',
 'Japan',
 'Japan',
 'Japan',
 'Japan',
 'Japan',
 'Japan',
 'Japan',
 'Japan',
 'Unknown',
 'Unknown',
 'Unknown',
 'Nigeria',
 'Unknown',
 'Unknown',
 'Unknown',
 'Unknown',
 'Unknown',
 'United States',
 'Nigeria',
 'Unknown',
 'Unknown',
 'Spain, United States',
 'France',
 'Unknown',
 'United Kingdom, United States',
 'United States',
 'United States',
 'Unknown',
 'India',
 'United States',
 'Unknown',
 'Unknown',
 'India',
 'United Kingdo

In [None]:
# The country that produces the most movies is United States, with 2055 productions. 
# In second place, India, with 893 movies.
Movies['country'].value_counts()

United States                           2055
India                                    893
Unknown                                  440
United Kingdom                           206
Canada                                   122
                                        ... 
Mauritius                                  1
United Kingdom, South Korea                1
Germany, France                            1
Mexico, Brazil                             1
Taiwan, China, France, United States       1
Name: country, Length: 652, dtype: int64

In [None]:
# Finally, let's use the 'groupby' function to group the title and type columns. 
title_type = netflix.groupby(['title', 'type'])

In [None]:
# Using the 'groupby' function, we can check them as follows:
title_type.sum().head(50)

Unnamed: 0_level_0,Unnamed: 1_level_0,release_year
title,type,Unnamed: 2_level_1
#Alive,Movie,2020
#AnneFrank - Parallel Stories,Movie,2019
#FriendButMarried,Movie,2018
#FriendButMarried 2,Movie,2020
#Roxy,Movie,2018
#Rucker50,Movie,2016
#Selfie,Movie,2014
#Selfie 69,Movie,2016
#blackAF,TV Show,2020
#cats_the_mewvie,Movie,2020


**Conclusion**

This was a tutorial on how to use basic pandas functions. From that point, it is possible to make further explorations and generate descriptive or predictive graphs to study the relationship between rows and/or columns. Hope it was helpful in some way.