# Introduction to Pandas
This notebook is designed to introduce you to the Python library, Pandas. In this notebook, you will learn how to:



*   Import the pandas library
*   Create a pandas series and dataframe
*   Perform basic exploratory data analysis
*   Read csv files into pandas
*   Write csv files from your notebook

## What is Pandas?

Pandas is a python library that works well for data analysis. It is specifically geared towards **tabular** data such as what you'd find in an Excel or Google spreadsheet. In order to use pandas, you must first import the library.

In [None]:
# Importing Pandas
import pandas as pd
import numpy as np

<module 'pandas' from '/usr/local/lib/python3.12/dist-packages/pandas/__init__.py'>

## Pandas Series and Dataframe

Pandas data are stored into either series or dataframes. A pandas series is one column of data, such as a list of the most popular television shows. A dataframe allows for multiple columns.

### Pandas Series

In [None]:
# create a pandas series
tv_shows = pd.Series(
    ["Breaking Bad",
     "The Sopranos",
     "Game of Thrones",
     "Looney Toons",
     "The Twilight Zone",
     "The Office",
     "Tom and Jerry",
     "Star Trek: The Original Series",
     "Band of Brothers",
     "Better Call Saul",
     "Seinfeld",
     "The X-Files",
     "M*A*S*H",
     "South Park",
     "Friends",
     "Scooby Doo",
     "Sherlock",
     "Chernobyl",
     "The Wire",
     "Stranger Things"
    ]
)

tv_shows

Unnamed: 0,0
0,Breaking Bad
1,The Sopranos
2,Game of Thrones
3,Looney Toons
4,The Twilight Zone
5,The Office
6,Tom and Jerry
7,Star Trek: The Original Series
8,Band of Brothers
9,Better Call Saul


In [None]:
# Add an index and name to the series
tv_shows = pd.Series(
    {1: "Breaking Bad",
     2: "The Sopranos",
     3: "Game of Thrones",
     4: "Looney Toons",
     5: "The Twilight Zone",
     6: "The Office",
     7: "Tom and Jerry",
     8: "Star Trek: The Original Series",
     9: "Band of Brothers",
     10: "Better Call Saul",
     11: "Seinfeld",
     12: "The X-Files",
     13: "M*A*S*H",
     14: "South Park",
     15: "Friends",
     16: "Scooby Doo",
     17: "Sherlock",
     18: "Chernobyl",
     19: "The Wire",
     20: "Stranger Things"
    }, name = "Top-Rated TV Shows"
)
tv_shows

Unnamed: 0,Top-Rated TV Shows
1,Breaking Bad
2,The Sopranos
3,Game of Thrones
4,Looney Toons
5,The Twilight Zone
6,The Office
7,Tom and Jerry
8,Star Trek: The Original Series
9,Band of Brothers
10,Better Call Saul


### Pandas Dataframe
To create a Dataframe, we can pass a dictionary into the DataFrame() pandas method. In this case, the keys will be the column names and the values will be a list of values for the column. Once created, each column and each row in the DataFrame will be its own series.

In [None]:
tv_shows = pd.DataFrame(
    {'shows': ["Breaking Bad", "The Sopranos", "Game of Thrones", "Looney Toons", "The Twilight Zone", "The Office", "Tom and Jerry", "Star Trek: The Original Series", "Band of Brothers",
     "Better Call Saul", "Seinfeld", "The X-Files", "M*A*S*H", "South Park", "Friends", "Scooby Doo", "Sherlock", "Chernobyl", "The Wire", "Stranger Things"],
     'ranking': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
     'year_start': [2008, 1999, 2011, 1930, 1959, 2005, 1975, 1966, 2001, 2015, 1989, 1993, 1972, 1997, 1994, 1969, 2010, 2019, 2002, 2016],
     'year_end': [2013, 2007, 2019, 2014, 1964, 2013, 2021, 1969, 2001, 2022, 1998, 2018, 1983, 2025, 2004, 2025, 2017, 2019, 2008, 2025],
     'no_of_seasons': [5, 6, 8, np.nan, 5, 9, 16, 3, 1, 6, 9, 11, 11, 2, 10, 31, 4, 1, 5, 5],
     'no_of_episodes': [62, 86, 73, np.nan, 156, 201, 254, 79, 10, 63, 180, 218, 256, 338, 236, 465, 13, 5, 60, 38],
     'imdb_rating': [9.5, 9.2, 9.2, np.nan, 9.0, 9.0, np.nan, 8.4, 9.4, 9.0, 8.9, 8.6, 8.5, 8.7, 8.9, np.nan, 9.0, 9.3, 9.3, 8.6]
    }
)
tv_shows

Unnamed: 0,shows,ranking,year_start,year_end,no_of_seasons,no_of_episodes,imdb_rating
0,Breaking Bad,1,2008,2013,5.0,62.0,9.5
1,The Sopranos,2,1999,2007,6.0,86.0,9.2
2,Game of Thrones,3,2011,2019,8.0,73.0,9.2
3,Looney Toons,4,1930,2014,,,
4,The Twilight Zone,5,1959,1964,5.0,156.0,9.0
5,The Office,6,2005,2013,9.0,201.0,9.0
6,Tom and Jerry,7,1975,2021,16.0,254.0,
7,Star Trek: The Original Series,8,1966,1969,3.0,79.0,8.4
8,Band of Brothers,9,2001,2001,1.0,10.0,9.4
9,Better Call Saul,10,2015,2022,6.0,63.0,9.0


In [None]:
print(tv_shows['shows'])
print(type(tv_shows['shows']))

0                       Breaking Bad
1                       The Sopranos
2                    Game of Thrones
3                       Looney Toons
4                  The Twilight Zone
5                         The Office
6                      Tom and Jerry
7     Star Trek: The Original Series
8                   Band of Brothers
9                   Better Call Saul
10                          Seinfeld
11                       The X-Files
12                           M*A*S*H
13                        South Park
14                           Friends
15                        Scooby Doo
16                          Sherlock
17                         Chernobyl
18                          The Wire
19                   Stranger Things
Name: shows, dtype: object
<class 'pandas.core.series.Series'>


In [None]:
print(tv_shows.iloc[0])
print(type(tv_shows.iloc[0]))

shows             Breaking Bad
ranking                      1
year_start                2008
year_end                  2013
no_of_seasons                5
no_of_episodes            62.0
imdb_rating                9.5
Name: 0, dtype: object
<class 'pandas.core.series.Series'>


## Exploratory Data Analysis (Dataframes)

In [None]:
tv_shows.head(10)

Unnamed: 0,shows,ranking,year_start,year_end,no_of_seasons,no_of_episodes,imdb_rating
0,Breaking Bad,1,2008,2013,5.0,62.0,9.5
1,The Sopranos,2,1999,2007,6.0,86.0,9.2
2,Game of Thrones,3,2011,2019,8.0,73.0,9.2
3,Looney Toons,4,1930,2014,,,
4,The Twilight Zone,5,1959,1964,5.0,156.0,9.0
5,The Office,6,2005,2013,9.0,201.0,9.0
6,Tom and Jerry,7,1975,2021,16.0,254.0,
7,Star Trek: The Original Series,8,1966,1969,3.0,79.0,8.4
8,Band of Brothers,9,2001,2001,1.0,10.0,9.4
9,Better Call Saul,10,2015,2022,6.0,63.0,9.0


In [None]:
tv_shows.shape

(20, 7)

In [None]:
tv_shows.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   shows           20 non-null     object 
 1   ranking         20 non-null     int64  
 2   year_start      20 non-null     int64  
 3   year_end        20 non-null     int64  
 4   no_of_seasons   20 non-null     object 
 5   no_of_episodes  19 non-null     float64
 6   imdb_rating     17 non-null     float64
dtypes: float64(2), int64(3), object(2)
memory usage: 1.2+ KB


In [None]:
tv_shows.describe()

Unnamed: 0,ranking,year_start,year_end,no_of_episodes,imdb_rating
count,20.0,20.0,20.0,19.0,17.0
mean,10.5,1991.5,2008.25,147.0,8.970588
std,5.91608,22.981686,17.752316,125.260617,0.325509
min,1.0,1930.0,1964.0,5.0,8.4
25%,5.75,1974.25,2003.25,61.0,8.7
50%,10.5,1998.0,2013.5,86.0,9.0
75%,15.25,2008.5,2019.5,227.0,9.2
max,20.0,2019.0,2025.0,465.0,9.5


In [None]:
tv_shows.columns

Index(['shows', 'ranking', 'year_start', 'year_end', 'no_of_seasons',
       'no_of_episodes', 'imdb_rating'],
      dtype='object')

#Index METHOD versus CLASS in Python

Tricky concept: the Index! The Index CLASS is distinct from the Index METHOD in Python: The index class is used in the Pandas library. The index() method is a built-in method that finds the position of a value at its first occurence. The index() method is not tied to Pandas specifically.

We'll first explore the index() method, because it's easier:

In [None]:
tv_shows.set_index('ranking', inplace=True)

In [None]:
tv_shows

## Reading and Writing Data in Pandas

https://www.kaggle.com/datasets/mohamedasak/metacritic-movies-dataset?resource=download

In [None]:
movies = pd.read_csv('metacritic_movies.csv')
movies

In [None]:
movies.head(10)

In [None]:
movies.describe()

In [None]:
movies.info()

In [None]:
movies['rating'].value_counts()

In [None]:
# Count missing values in each column
print(movies.isnull().sum())

# % missing values in each column
print((movies.isnull().sum() / len(movies)) * 100)

#The Index Class in Pandas:

The Pandas Index class provides an immutable array of labels for the axes of Series and DataFrame, enabling powerful indexing, alignment, and data structure operations.

Used to create index objects for labeling/identifying axes in Series and DataFrame.

In [None]:
idx = pd.Index(['a', 'b', 'c'])

# Exploratory Data Visualization in Pandas

In [None]:
tv_shows.to_csv('tv_shows.csv')

In [None]:
import matplotlib.pyplot as plt
import pandas as pd


# Clean if necessary
tv_shows['shows'] = tv_shows['shows'].astype(str).str.replace('"', '').str.strip()
tv_shows['imdb_rating'] = pd.to_numeric(tv_shows['imdb_rating'], errors='coerce')

# Group and sum relative change
shows_sum = tv_shows.groupby('shows')['imdb_rating'].sum().sort_values()

plt.figure(figsize=(10,6))
shows_sum.plot(kind='bar', color='tab:blue')
plt.xlabel('Shows')
plt.ylabel('IMDB Rating')
plt.title('IMDB Rating by Show')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(8,6))
tv_shows.boxplot(column='imdb_rating', by='shows')
plt.ylabel('IMDB Rating')
plt.title('IMDB Rating by Show')
plt.suptitle('')
plt.tight_layout()
plt.xticks(rotation=80)

plt.show()

In [None]:
import networkx as nx
import matplotlib.pyplot as plt

# Take a subset of the movies data for better visualization
sample_movies = movies.head(10).copy()

# Create a graph
G = nx.Graph()

# Add nodes and edges
for index, row in sample_movies.iterrows():
    movie_title = row['title']
    directors_str = row['director']

    # Add movie node
    G.add_node(movie_title, bipartite=0, type='movie')

    # Handle multiple directors if present (split by comma and strip whitespace)
    if pd.notna(directors_str):
        directors = [d.strip() for d in directors_str.split(',')]
        for director in directors:
            # Add director node
            G.add_node(director, bipartite=1, type='director')
            # Add edge between movie and director
            G.add_edge(movie_title, director)

# Separate nodes into two sets for bipartite layout
movie_nodes = [n for n, d in G.nodes(data=True) if d['bipartite'] == 0]
director_nodes = [n for n, d in G.nodes(data=True) if d['bipartite'] == 1]

# Draw the graph
plt.figure(figsize=(14, 10)) # Adjust figure size for better readability

# Use a spring layout for better node distribution
pos = nx.spring_layout(G, k=0.5, iterations=50)

nx.draw_networkx_nodes(G, pos, nodelist=movie_nodes, node_color='skyblue', label='Movies', node_size=1500, alpha=0.9)
nx.draw_networkx_nodes(G, pos, nodelist=director_nodes, node_color='lightcoral', label='Directors', node_size=1500, alpha=0.9)
nx.draw_networkx_edges(G, pos, width=1.0, alpha=0.6, edge_color='gray')
nx.draw_networkx_labels(G, pos, font_size=9, font_weight='bold')

plt.title("Network of Movies and Directors (First 10 Movies)", size=15)
plt.legend(loc='best', fontsize=10)
plt.axis('off') # Hide axes
plt.tight_layout()
plt.show()

#Try it yourself!

In [None]:
Take one of the visualizations and make it your own!
1) Change how many movies or tv shows appear in the graph.
2) How do you exlcude a value that already doesn't exist? Hint: look at the IMDB rating
3) Can you figure out a way to change the colors of the graphs?