In [93]:
import pandas as pd 
data=pd.read_csv("imdb.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Poster          10000 non-null  object 
 1   Title           10000 non-null  object 
 2   Year            9850 non-null   float64
 3   Certificate     7370 non-null   object 
 4   Duration (min)  9664 non-null   float64
 5   Genre           9993 non-null   object 
 6   Rating          9596 non-null   float64
 7   Metascore       7555 non-null   float64
 8   Director        9995 non-null   object 
 9   Cast            9961 non-null   object 
 10  Votes           9596 non-null   object 
 11  Description     10000 non-null  object 
 12  Review Count    9999 non-null   object 
 13  Review Title    9483 non-null   object 
 14  Review          9484 non-null   object 
dtypes: float64(4), object(11)
memory usage: 1.1+ MB


In [94]:
data.head(3)

Unnamed: 0,Poster,Title,Year,Certificate,Duration (min),Genre,Rating,Metascore,Director,Cast,Votes,Description,Review Count,Review Title,Review
0,https://m.media-amazon.com/images/M/MV5BYWRkZj...,The Idea of You,2023.0,R,115.0,"Comedy, Drama, Romance",6.4,67.0,Michael Showalter,"Anne Hathaway, Nicholas Galitzine, Ella Rubin,...",28744,"Solène, a 40-year-old single mom, begins an un...",166,Hypocrisy as an idea,"This film, as well as the reaction to it, is a..."
1,https://m.media-amazon.com/images/M/MV5BZGI4NT...,Kingdom of the Planet of the Apes,2023.0,PG-13,145.0,"Action, Adventure, Sci-Fi",7.3,66.0,Wes Ball,"Owen Teague, Freya Allan, Kevin Durand, Peter ...",22248,"Many years after the reign of Caesar, a young ...",183,A phenomenal start to another trilogy!,"I'm a big fan of all the planet of the apes, a..."
2,https://m.media-amazon.com/images/M/MV5BZjIyOT...,Unfrosted,2023.0,PG-13,97.0,"Biography, Comedy, History",5.5,42.0,Jerry Seinfeld,"Isaac Bae, Jerry Seinfeld, Chris Rickett, Rach...",18401,"In 1963 Michigan, business rivals Kellogg's an...",333,not funny,Pretty much the worst criticism you can lay on...


In [95]:
data=data.drop(['Poster','Review Title','Description','Review'],axis=1)

 By dropping these columns, the DataFrame is streamlined to focus on essential movie attributes, excluding non-essential information like poster images and full reviews.

**Dropping**:
The `data.drop()` method in Pandas is used to remove rows or columns from a DataFrame. It allows us to specify the labels (row or column names) to be dropped, along with the axis (0 for rows, 1 for columns) to perform the operation on.

- You can pass a single label or a list of labels to remove[1][2].
- By specifying `axis=0` (default), it removes rows based on the labels[1][2].
- By specifying `axis=1`, it removes columns based on the labels[1][2].
- Setting `inplace=True` modifies the original DataFrame, while `inplace=False` (default) returns a new DataFrame with the dropped rows/columns[1][2].
- If any of the specified labels don't exist, it raises an error by default. Setting `errors='ignore'` ignores the error and drops the rest of the valid labels[2].

In [96]:
data.head()

Unnamed: 0,Title,Year,Certificate,Duration (min),Genre,Rating,Metascore,Director,Cast,Votes,Review Count
0,The Idea of You,2023.0,R,115.0,"Comedy, Drama, Romance",6.4,67.0,Michael Showalter,"Anne Hathaway, Nicholas Galitzine, Ella Rubin,...",28744,166
1,Kingdom of the Planet of the Apes,2023.0,PG-13,145.0,"Action, Adventure, Sci-Fi",7.3,66.0,Wes Ball,"Owen Teague, Freya Allan, Kevin Durand, Peter ...",22248,183
2,Unfrosted,2023.0,PG-13,97.0,"Biography, Comedy, History",5.5,42.0,Jerry Seinfeld,"Isaac Bae, Jerry Seinfeld, Chris Rickett, Rach...",18401,333
3,The Fall Guy,2023.0,PG-13,126.0,"Action, Comedy, Drama",7.3,73.0,David Leitch,"Ryan Gosling, Emily Blunt, Aaron Taylor-Johnso...",38953,384
4,Challengers,2023.0,R,131.0,"Drama, Romance, Sport",7.7,82.0,Luca Guadagnino,"Zendaya, Mike Faist, Josh O'Connor, Darnell Ap...",32517,194


In [97]:
# Change the data type of a column to numeric
data['Votes'] = pd.to_numeric(data['Votes'], errors='coerce')
data['Review Count'] = pd.to_numeric(data['Review Count'], errors='coerce')

By converting these columns to numeric data types, you can perform mathematical operations and comparisons on the values more efficiently. For example, you can calculate statistics, filter data based on numeric conditions, or plot graphs using these columns.

**Changing datatypes of columns**:
1. Here we convert 'Votes' column to numeric format using the `pd.to_numeric()` function from Pandas. The `errors='coerce'` parameter is used to replace any non-numeric values with NaN (Not a Number) values.

2. we convert the 'Review Count' column to numeric format using the same approach as above.


In [98]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Title           10000 non-null  object 
 1   Year            9850 non-null   float64
 2   Certificate     7370 non-null   object 
 3   Duration (min)  9664 non-null   float64
 4   Genre           9993 non-null   object 
 5   Rating          9596 non-null   float64
 6   Metascore       7555 non-null   float64
 7   Director        9995 non-null   object 
 8   Cast            9961 non-null   object 
 9   Votes           559 non-null    float64
 10  Review Count    9565 non-null   float64
dtypes: float64(6), object(5)
memory usage: 859.5+ KB


In [99]:
len(data)

10000

In [100]:
data['Year'] = data['Year'].fillna(0).astype(int)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Title           10000 non-null  object 
 1   Year            10000 non-null  int32  
 2   Certificate     7370 non-null   object 
 3   Duration (min)  9664 non-null   float64
 4   Genre           9993 non-null   object 
 5   Rating          9596 non-null   float64
 6   Metascore       7555 non-null   float64
 7   Director        9995 non-null   object 
 8   Cast            9961 non-null   object 
 9   Votes           559 non-null    float64
 10  Review Count    9565 non-null   float64
dtypes: float64(5), int32(1), object(5)
memory usage: 820.4+ KB


1. Here the year column contain the values string and some of the null values are replaced with Nan which is a string or object type . so inorder to change its type to integer we need to handle these string types in that column ..
**Filling NUll Values:**

2. The `fillna()` method is used to replace missing values (NaN) with a specified value. In this case, missing values in the 'Year' column will be replaced with 0.

3. `astype(int)`: After filling the missing values, the `astype()` method is used to convert the data type of the 'Year' column to integers. This ensures that the values in the column are whole numbers without any decimal points or fractions.


 To convert the year into integer we have to fill it with values or drop the null values .
 so the missing values in the 'Year' column are filled with 0, and then the column is converted to integers.

In [101]:
df_non_zero_years = data[data['Year'] != 0]
min_year = np.nanmin(df_non_zero_years['Year'] )
oldest_movies = data[data['Year'] == min_year]
print(len(oldest_movies))
oldest_movies.head(4)

50


Unnamed: 0,Title,Year,Certificate,Duration (min),Genre,Rating,Metascore,Director,Cast,Votes,Review Count
9900,Dag II,1929,,135.0,"Action, Drama, War",8.2,,Alper Caglar,"Caglar Ertugrul, Ufuk Bayraktar, Ahu Türkpençe...",,141.0
9901,Everybody's Fine,1929,PG-13,100.0,"Adventure, Drama",7.1,47.0,Kirk Jones,"Robert De Niro, Kate Beckinsale, Sam Rockwell,...",,111.0
9902,The Tinder Swindler,1929,16,114.0,"Documentary, Crime",7.1,,Felicity Morris,"Simon Leviev, Cecilie Fjellhøy, Ayleen Charlot...",,279.0
9903,Dark Matter,1929,R,88.0,Drama,6.0,49.0,Shi-Zheng Chen,"Ye Liu, Aidan Quinn, Meryl Streep, Peng Chi",,23.0


**Finding old movies based on year:**

1. **Filtering non-zero years**: Here we create `df_non_zero_years` by filtering the original `data` DataFrame to include only rows where the 'Year' column is not equal to 0.

2. **Finding the minimum year**:  `np.nanmin()` to find the minimum year in the `df_non_zero_years` DataFrame, ignoring any NaN (Not a Number) values. This minimum year is stored in the `min_year` variable.

- there are a total 50 movies released in the year 1929 and got rating in imdb . 
- This data is not helps us in finding any useful infomation but it is helping us to know that we have a lot of data and need to remove more for better analysis

In [102]:
data.isnull().sum()

Title                0
Year                 0
Certificate       2630
Duration (min)     336
Genre                7
Rating             404
Metascore         2445
Director             5
Cast                39
Votes             9441
Review Count       435
dtype: int64

In [104]:
data = data.dropna(subset=['Certificate'])
data.isnull().sum() 

Title                0
Year                 0
Certificate          0
Duration (min)      31
Genre                0
Rating              58
Metascore          890
Director             0
Cast                 1
Votes             7267
Review Count       434
dtype: int64

Certificate is the type of the each movie . For our analysis we dont need the data of movies which have not categorized may not have much ratings . so we simple dont need them , so we remove the entire row of the movies which have no certificate .

**Dropping Null Values**:
   - `subset=['Certificate']`: This parameter specifies that we only want to check for missing values in the 'Certificate' column. The function will drop any rows where the value in the 'Certificate' column is missing (NaN).

In [105]:
data = data.dropna(subset=['Cast'])

For our analysis we need directors and actors of that movie also so we dont need the movie which have no actors

In [106]:
import numpy as np

mean_duration = np.mean(data['Duration (min)'].dropna())
mean_votes = np.mean(data['Votes'].dropna())
mean_review_count = np.mean(data['Review Count'].dropna())
mean_rating = np.mean(data['Rating'].dropna())

print("Mean Duration (min):", mean_duration)
print("Mean Votes:", mean_votes)
print("Mean Review Count:", mean_review_count)
print("Mean Rating:", mean_rating)


Mean Duration (min): 109.75551921504497
Mean Votes: 391.18446601941747
Mean Review Count: 244.44931506849315
Mean Rating: 6.48551497743126


**Finding central tendency**:

The `np.mean()` function is used to calculate the average or mean value of a dataset. In this case, it's being used to find the mean values for the 'Duration (min)', 'Votes', 'Review Count', and 'Rating' columns in the `data` DataFrame.
we can use `np.mode(),median(),etc for better analysis`.

- Here we calculated the mean of these to know :
  - average rating is 6.4 which is low , that means many of these movies are got lesser ratings 
  - other mean values also comparibly very low . so we need to selectively drop some data from these movies .
  - but before that we have calculated this average means .and our data is too big and have a lot of null values . movies contain null values in rating may not contain null values in duration and in other aspects . so we need to fill these columns with thier mean values to avoid losing valueable data.

In [107]:
data['Duration (min)'] = data['Duration (min)'].fillna(mean_duration)
data['Votes'] = data['Votes'].fillna(mean_votes)
data['Review Count'] = data['Review Count'].fillna(mean_review_count)
data['Rating'] = data['Rating'].fillna(mean_rating)
data.isnull().sum()

Title               0
Year                0
Certificate         0
Duration (min)      0
Genre               0
Rating              0
Metascore         890
Director            0
Cast                0
Votes               0
Review Count        0
dtype: int64

**Filling null values with mean**:
- previously we have calculated the mean to analyze the data . 
- so we use those mean data to fill the null values

In [108]:
best_mask = np.where(data['Rating'] > 7, True, False)
good_mask = np.where((data['Rating'] >= 5) & (data['Rating'] <= 7), True, False)
worse_mask = np.where(data['Rating'] < 5, True, False)


# **Masking**:
Normally Masking is done by using mask() meythod from pandas , but here we used numpy to create a mask.

**Mask**:
- Mask is an array of True or false or (0 or 1) elements . 
- For example mask for the data which contain movies of rating > 7. contains the array of total rows and each row contain either 0 or 1 that means that movie contain rating >7 or not . in this way we create a mask of that data .

- By creating a mask we can decrease the use of memory . 
- For example ,here we took 3 types of movies good best and worse . If worse is very few then they may be useful . so we wont lose them . so before taking into action we analyzed the result by using the mask .

1. **Creating Mask Conditions**:
   - The code uses `np.where()` to create boolean masks based on the 'Rating' column in the DataFrame `data`.
2. **Counting Movies in Each Category**:
   - The code uses `np.sum()` to count the number of True values in each mask.

In [109]:
# Print the count of each category
print("Best Movies (Rating > 7):", np.sum(best_mask))
print("Good Movies (5 <= Rating <= 7):", np.sum(good_mask))
print("Worse Movies (Rating < 5):", np.sum(worse_mask))


Best Movies (Rating > 7): 2215
Good Movies (5 <= Rating <= 7): 4653
Worse Movies (Rating < 5): 501


Here we got best movies lesser and good movies more . and worse movies are very few . so if we take best movies only we cant analyze the data better so we only take best and good movies for our analysis. 

In [20]:
best_movies = data[best_mask|good_mask]
len(best_movies)

6868

In [22]:
best_movies.isnull().sum()

Title               0
Year                0
Certificate         0
Duration (min)      0
Genre               0
Rating              0
Metascore         726
Director            0
Cast                0
Votes               0
Review Count        0
dtype: int64

We dont need the metascore cause it is the crtic score of the movie and rating is enough for our movies. so we drop it .

In [24]:
best_movies=best_movies.drop(['Metascore'],axis=1)


In [28]:
avg_rating = best_movies['Rating'].mean()

# Calculate the average review counts
avg_review_count = best_movies['Review Count'].mean()

print("Average Ratings:", average_rating)
print("Average Review Counts:", average_review_count)

Average Ratings: 6.649732071737161
Average Review Counts: 248.3912287280304


Comparing to the mean of these columns to before . we got slightly higher ratings in the best_movies.

**why we are analyzing this data**:

Any dataset is analyzed to actually find usefull information from it . For example from this data we can find many useful things like. 
- finding best movies to watch.
- making reccomendations based on users needs like favourite genre movies, favourite movie types.
- which director is popular .
- finding oldest best movies based on the ratings .
- movies that are time waste to watch.
- finding best love stories,action,dramas etc.
- if you are opening a movie broadcasting channel we can choose better movies to play for faster subscribers.
- we can make predictions for directors to make which type of movies the next year based on users minds.

etc more ...........

Here in this analysis imagine we have money to make a movie this year . so inorder to get maximum profits from our movie . we need to choose the best of everything by analyzing this data .

In [111]:

genre_ratings = best_movies.groupby('Genre')['Rating'].mean().sort_values(ascending=False)
genre_ratings.head(3)


Genre
Documentary, Biography, Sport    8.50
Animation, Drama, War            8.50
Drama, Mystery, War              8.35
Name: Rating, dtype: float64

**Finding best genres from best movies only:**

1. **Grouping by Genre and Calculating Mean Ratings**:
   - we  group the `best_movies` DataFrame by the 'Genre' column using `groupby('Genre')`.
   - For each genre, it calculates the mean of the 'Rating' column using `['Rating'].mean()`.
2. **Sorting **:
   - The `genre_ratings` Series is sorted in descending order based on the rating values using `sort_values(ascending=False)`.
   
from the above we can choose best genre to make a movie for better profits . we choose Animation,Drama,war cause it will attract children too .

In [112]:
all_actors = []
for index, row in best_movies.iterrows():
    actors = [actor.strip() for actor in row['Cast'].split(',')]
    all_actors.extend(actors)
actors_df = pd.DataFrame(all_actors, columns=['Actor'])
actors_df['Number of Movies'] = actors_df['Actor'].map(all_actors.count)

In our data each movie has some actors in cast attribute and all the actors names are placed inbetween commas .
1. **Extracting Actors from Best Movies**:
   - The code iterates over the rows of the `best_movies` DataFrame using `iterrows()`.
   - For each row, it extracts the actors from the 'Cast' column by splitting the string at commas and stripping any leading or trailing whitespace.
   - The extracted actors are added to the `all_actors` list.
2. **Counting Movies for Each Actor**:
   - A new column 'Number of Movies' is added to `actors_df`.
   - The `map()` function is used to count the number of occurrences of each actor in the 'Actor' column, effectively counting the number of movies each actor appears in.
   - The counts are stored in the 'Number of Movies' column.
   
Later this dataset will be used to find best actors to pick for our movie

In [113]:
# Calculate the average ratings for each actor
actor_avg_ratings = []
for actor in actors_df['Actor']:
    # Filter the movies where the actor is part of the cast
    actor_movies = best_movies[best_movies['Cast'].str.contains(actor)]
    # Calculate the average rating for the actor
    avg_rating = actor_movies['Rating'].mean()
    actor_avg_ratings.append(avg_rating)

# Add the average ratings to the actors_df DataFrame
actors_df['Average Rating'] = actor_avg_ratings
# Drop duplicates to keep unique actors
actors_df = actors_df.drop_duplicates().reset_index(drop=True)

To find the average rating of ac actor we need to get a list of ratings of the actor so ,
- for each actor we stored in actors_df . we get a list of his movie ratings and later we find his mean rating and put it in average ratings column.


In [114]:

actors_df = actors_df.sort_values(by='Number of Movies', ascending=False)

actors_df.head(10)

Unnamed: 0,Actor,Number of Movies,Average Rating
289,Robert De Niro,63,6.983897
93,Nicolas Cage,60,6.374759
387,Tom Hanks,52,7.042029
155,Samuel L. Jackson,49,6.673469
111,Mark Wahlberg,47,6.469905
100,Liam Neeson,46,6.519565
110,Matt Damon,45,7.070789
310,Bruce Willis,44,6.672727
1001,Nicole Kidman,43,6.550489
792,Johnny Depp,43,6.890698


From the above resulted actors we can choose as many top actors with high experience in acting and also have great movie ratings. so with this cast the audience may get much intrest in the movie.

In [58]:

director_first_year = {}
for index, row in best_movies.iterrows():
    director = row['Director']
    year = row['Year']
    if director in director_first_year:
        if year != 0 and (year < director_first_year[director] or director_first_year[director] == 0):
            director_first_year[director] = year
    else:
        director_first_year[director] = year
directors = pd.DataFrame({'Director': list(director_first_year.keys()),
                                     'First Movie Year': list(director_first_year.values())})

Now we have to choose a director .
- director must have experience in making movies that means number of movies he direct must be greater
- and his movies also should get higher ratings ,so the viewers will have intrest to watch.
- director's experience in movie industry is also important so his first movie year should be taken. 

**Updating Director's First Movie Year**:
we have so many null values in our `year` attribute so we will get directors first movie year zero . so to avoid it we need to get second lowest year than zero . 
   - If the director is already in `director_first_year`:
     - If the current year is non-zero and earlier than the stored year for that director, or if the stored year is zero, the year is updated.
   - If the director is not in the dictionary, the year is added for that director.


This code efficiently identifies the first movie year for each director in the `best_movies` dataset, providing valuable insights into the career trajectories and early successes of directors. The resulting DataFrame `directors` can be used for further analysis and visualization of this information.

In [59]:
directors['Number of Movies'] = directors['Director'].map(best_movies['Director'].value_counts())
director_avg_ratings = best_movies.groupby('Director')['Rating'].mean().to_dict()
directors['Average Rating'] = directors['Director'].map(director_avg_ratings)


We calculate number of movies for each director and also average ratings of a director's movies.

In [128]:
directors = directors.sort_values(by='Number of Movies', ascending=False)
print(len(directors))
directors.head(10)


2952


Unnamed: 0,Director,First Movie Year,Number of Movies,Average Rating
501,Clint Eastwood,1964,34,7.020162
92,Steven Spielberg,1964,33,7.387879
74,Ridley Scott,1933,27,7.014815
207,Ron Howard,1929,23,6.956522
326,Steven Soderbergh,1934,23,6.569565
72,Martin Scorsese,1985,22,7.677273
96,Robert Zemeckis,1965,21,6.994548
421,Woody Allen,1934,19,6.973684
28,Tim Burton,1964,19,7.005263
647,David Cronenberg,1933,18,6.705556


From the above results we can see the data that is sorted based on number of movies .
- we can conclude which director should be taken to get a best movie from this data .
- for ex: clint west wood has experience in making many movies and also have the starting year 1964 . and average rating of his movies also great .

In [49]:
person_name = "Clint Eastwood"  
person_movies = data[data['Cast'].str.contains(person_name) | data['Director'].str.contains(person_name)]

len(person_movies)

51

Clint Eastwood has also acted in many movies . so totally he have experience in 51 movies . so it would be best to take him as a director for our movie .