In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import HTML
pd.options.display.max_columns = 30
pd.options.display.float_format = '{:.2}'.format

In [18]:
df = pd.read_csv('movies_complete.csv', parse_dates=['release_date'])


The parse_dates parameter in the pandas.read_csv() function allows specifying which columns should be parsed as dates. By default, when reading a CSV file, Pandas treats all columns as strings or numeric values. However, columns representing dates or timestamps can be parsed into datetime objects using parse_dates, making it easier to work with time-related data.

It accepts different values:

True: Pandas will attempt to parse all columns as dates. If a column cannot be parsed as a date, it will retain its original data type (such as a string or object).

In [20]:
df.head()

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,production_countries,vote_count,vote_average,popularity,runtime,overview,spoken_languages,poster_path,cast,cast_size,crew_size,director
0,862,Toy Story,,1995-10-30,Animation|Comedy|Family,Toy Story Collection,en,30.0,370.0,Pixar Animation Studios,United States of America,5400.0,7.7,22.0,81.0,"Led by Woody, Andy's toys live happily in his ...",English,<img src='http://image.tmdb.org/t/p/w185//uXDf...,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,13,106,John Lasseter
1,8844,Jumanji,Roll the dice and unleash the excitement!,1995-12-15,Adventure|Fantasy|Family,,en,65.0,260.0,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,2400.0,6.9,17.0,100.0,When siblings Judy and Peter discover an encha...,English|Français,<img src='http://image.tmdb.org/t/p/w185//vgpX...,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,26,16,Joe Johnston
2,15602,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for...,1995-12-22,Romance|Comedy,Grumpy Old Men Collection,en,,,Warner Bros.|Lancaster Gate,United States of America,92.0,6.5,12.0,100.0,A family wedding reignites the ancient feud be...,English,<img src='http://image.tmdb.org/t/p/w185//1FSX...,Walter Matthau|Jack Lemmon|Ann-Margret|Sophia ...,7,4,Howard Deutch
3,31357,Waiting to Exhale,Friends are the people who let you be yourself...,1995-12-22,Comedy|Drama|Romance,,en,16.0,81.0,Twentieth Century Fox Film Corporation,United States of America,34.0,6.1,3.9,130.0,"Cheated on, mistreated and stepped on, the wom...",English,<img src='http://image.tmdb.org/t/p/w185//4wjG...,Whitney Houston|Angela Bassett|Loretta Devine|...,10,10,Forest Whitaker
4,11862,Father of the Bride Part II,Just When His World Is Back To Normal... He's ...,1995-02-10,Comedy,Father of the Bride Collection,en,,77.0,Sandollar Productions|Touchstone Pictures,United States of America,170.0,5.7,8.4,110.0,Just when George Banks has recovered from his ...,English,<img src='http://image.tmdb.org/t/p/w185//lf9R...,Steve Martin|Diane Keaton|Martin Short|Kimberl...,12,7,Charles Shyer


The df.head() function in pandas is used to display the first five rows of the DataFrame df. This function is helpful for getting a quick overview of the data structure and checking the contents of the dataset after loading it.

If you'd like to display more (or fewer) rows, you can specify the number inside the head() function, like df.head(10) to show the first 10 rows.

**Information OF DF:**

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44691 entries, 0 to 44690
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   id                     44691 non-null  int64         
 1   title                  44691 non-null  object        
 2   tagline                20284 non-null  object        
 3   release_date           44657 non-null  datetime64[ns]
 4   genres                 42586 non-null  object        
 5   belongs_to_collection  4463 non-null   object        
 6   original_language      44681 non-null  object        
 7   budget_musd            8854 non-null   float64       
 8   revenue_musd           7385 non-null   float64       
 9   production_companies   33356 non-null  object        
 10  production_countries   38835 non-null  object        
 11  vote_count             44691 non-null  float64       
 12  vote_average           42077 non-null  float64       
 13  p

The df.info() method in pandas provides a summary of the DataFrame, including:

The number of entries (rows).
The number of non-null values in each column.
The data type of each column.
The memory usage of the DataFrame.
This method is useful for getting a quick overview of the dataset's structure, including data types and missing values.

Key Points:

Entries: Total number of rows.
Non-Null Count: Number of non-null values per column.
Data Types: Type of data in each column (e.g., integers, floats, strings, dates).
Memory Usage: Memory used by the DataFrame.


**Access To A Specific Column**

In [24]:
df['genres'][0]

'Animation|Comedy|Family'

In [26]:
df.genres[0]

'Animation|Comedy|Family'

In [28]:
df['cast'][0]

'Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wallace Shawn|John Ratzenberger|Annie Potts|John Morris|Erik von Detten|Laurie Metcalf|R. Lee Ermey|Sarah Freeman|Penn Jillette'

**Description OF DataFrame**

In [30]:
df.describe()

Unnamed: 0,id,release_date,budget_musd,revenue_musd,vote_count,vote_average,popularity,runtime,cast_size,crew_size
count,45000.0,44657,8900.0,7400.0,45000.0,42000.0,45000.0,43000.0,45000.0,45000.0
mean,110000.0,1992-04-28 16:30:02.539355520,22.0,69.0,110.0,6.0,3.0,98.0,12.0,10.0
min,2.0,1874-12-09 00:00:00,1e-06,1e-06,0.0,0.0,0.0,1.0,0.0,0.0
25%,26000.0,1978-08-12 00:00:00,2.0,2.4,3.0,5.3,0.4,86.0,6.0,2.0
50%,59000.0,2001-08-16 00:00:00,8.2,17.0,10.0,6.1,1.2,95.0,10.0,6.0
75%,150000.0,2010-12-10 00:00:00,25.0,68.0,35.0,6.8,3.8,110.0,15.0,12.0
max,470000.0,2017-12-27 00:00:00,380.0,2800.0,14000.0,10.0,550.0,1300.0,310.0,440.0
std,110000.0,,34.0,150.0,500.0,1.3,6.0,35.0,12.0,16.0


The df.describe() method in pandas provides a statistical summary of the numerical columns in a DataFrame. This summary includes:

* count: The number of non-null values.
* mean: The average value.
* std: The standard deviation, showing the amount of variation.
* min: The minimum value.
* 25%: The 25th percentile value (first quartile).
* 50%: The 50th percentile value (median or second quartile).
* 75%: The 75th percentile value (third quartile).
* max: The maximum value.

  **Key Points**
* Numerical Columns: Only numerical columns are summarized. Non-numeric columns are excluded.
* Understanding Distribution: Helps in understanding the distribution, spread, and central tendency of your data.
* Outlier Detection: Can assist in identifying potential outliers by examining the min and max values and the range between quartiles.
