# Summary of the Lecture on pandas and DataFrames
Introduction to pandas – pandas is a Python package for data manipulation and visualization, built on top of NumPy and Matplotlib. It is widely used in the data science community.

Course Overview – The course covers DataFrames, data aggregation, slicing/indexing, data visualization, handling missing data, and reading data into DataFrames.

Understanding DataFrames – pandas is designed for working with rectangular/tabular data, where each row represents an observation and each column represents a variable. Similar concepts exist in R (DataFrames) and SQL (tables).

Exploring a DataFrame –

.head() – Displays the first few rows of a dataset.
print(dongs.head())


.info() – Shows column names, data types, and missing values.
print(dogs.info())


.shape – Provides the number of rows and columns as a tuple.
print(dogs.shape())


.describe() – Computes summary statistics for numeric columns.
print(doga.describe())


DataFrame Components –

.values – Contains the data in a 2D NumPy array.
print(dogs.values)

.columns – Stores column names.
print(dogs.columns)


.index – Stores row labels or numbers.
print(dogs.index)


pandas Philosophy – Unlike Python’s "Zen of Python," which suggests one obvious way to solve a problem, pandas offers multiple approaches, making it flexible but sometimes harder to learn. The course focuses on essential methods for efficiency.

Hands-on Practice – Encouragement to start coding and applying pandas methods.

In [1]:
#import data

import pandas as pd
netflix = pd.read_csv(r"C:\Users\Nikhil Patil\Downloads\Datasets\netflix_titles.csv")

In [2]:
#inspecting of DF

#head
print(netflix.head(1))

#information
print(netflix.info())

#shape
print(netflix.shape)

#Description
print(netflix.describe())

  show_id   type                 title         director cast        country  \
0      s1  Movie  Dick Johnson Is Dead  Kirsten Johnson  NaN  United States   

           date_added  release_year rating duration      listed_in  \
0  September 25, 2021          2020  PG-13   90 min  Documentaries   

                                         description  
0  As her father nears the end of his life, filmm...  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      88

In [3]:
#Parts of DF

#Values
print(netflix.values)

#columns
print(netflix.columns)

#index
print(netflix.index)

[['s1' 'Movie' 'Dick Johnson Is Dead' ... '90 min' 'Documentaries'
  'As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.']
 ['s2' 'TV Show' 'Blood & Water' ... '2 Seasons'
  'International TV Shows, TV Dramas, TV Mysteries'
  'After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth.']
 ['s3' 'TV Show' 'Ganglands' ... '1 Season'
  'Crime TV Shows, International TV Shows, TV Action & Adventure'
  'To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war.']
 ...
 ['s8805' 'Movie' 'Zombieland' ... '88 min' 'Comedies, Horror Movies'
  'Looking to survive in a world taken over by zombies, a dorky college student teams with an urban roughneck and a pair of grifter sisters.']
 ['s8806' 'Movie' 'Zoom' ... '88 min'
  '

# Sorting and Subsetting in pandas
Introduction – Sorting and subsetting are two fundamental ways to extract meaningful insights from a DataFrame.

*Sorting Data –*

Use .sort_values() to reorder rows based on a column (e.g., sorting dogs by weight).
Set ascending=False to sort in descending order.

Sort by multiple columns by passing a list of column names. You can also specify different sorting orders for each column.

*Subsetting Data –*

Columns: Use DataFrame[column_name] for a single column or DataFrame[[col1, col2]] for multiple columns.

Rows: Use logical conditions inside square brackets to filter rows.

Text-based filtering: Use == to match text values (e.g., selecting only Labradors).

Date-based filtering: Use conditions like DataFrame[date_column < 'YYYY-MM-DD'] to filter by date.

Multiple conditions: Use logical operators (& for AND, | for OR) with parentheses for combining multiple filters.

Using .isin(): This method filters rows where a column matches any value in a given list (e.g., selecting dogs that are black or brown).

Practice – Encouragement to apply sorting and subsetting techniques in code.

In [36]:
#Sorting 
print(netflix.sort_values("duration"))

#Orting for multiple columns 
print(netflix.sort_values(["duration", "date_added"]))

#sorting with order
print(netflix.sort_values(["duration", "date_added"], ascending=[True, False]))



     show_id     type                                 title      director  \
8216   s8217  TV Show                        The Bomb Squad           NaN   
5392   s5393  TV Show         Barbie Life in the Dreamhouse           NaN   
3794   s3795  TV Show                     Historical Roasts           NaN   
1593   s1594  TV Show                      Kings of Jo'Burg           NaN   
5393   s5394  TV Show                              Breakout           NaN   
...      ...      ...                                   ...           ...   
5976   s5977    Movie           Çok Filim Hareketler Bunlar  Ozan Açıktan   
4401   s4402    Movie         All's Well, End's Well (2009)   Vincent Kok   
5541   s5542    Movie                       Louis C.K. 2017    Louis C.K.   
5794   s5795    Movie                 Louis C.K.: Hilarious    Louis C.K.   
5813   s5814    Movie  Louis C.K.: Live at the Comedy Store    Louis C.K.   

                                                   cast  \
8216            

In [40]:
#subsetting Columns 

duration_sebset = netflix["duration"]
print(duration_sebset.head())


0       90 min
1    2 Seasons
2     1 Season
3     1 Season
4    2 Seasons
Name: duration, dtype: object


In [41]:
#subsetting multiple Columns 

Duearion_year = netflix[["duration", "release_year"]]
print(Duearion_year.head())

    duration  release_year
0     90 min          2020
1  2 Seasons          2021
2   1 Season          2021
3   1 Season          2021
4  2 Seasons          2021


In [45]:
#subsetting rows

yr = netflix[netflix["release_year"] == 2020]
print(yr.head())


   show_id     type                                              title  \
0       s1    Movie                               Dick Johnson Is Dead   
16     s17    Movie  Europe's Most Dangerous Man: Otto Skorzeny in ...   
17     s18  TV Show                                    Falsa identidad   
32     s33  TV Show                                      Sex Education   
34     s35  TV Show                            Tayo and Little Wizards   

                                         director  \
0                                 Kirsten Johnson   
16  Pedro de Echave García, Pablo Azorín Williams   
17                                            NaN   
32                                            NaN   
34                                            NaN   

                                                 cast         country  \
0                                                 NaN   United States   
16                                                NaN             NaN   
17  Luis Ernesto 

In [50]:
#subsetting 2 rows

combo = netflix[(netflix["release_year"] == 2020) & (netflix["rating"] == "PG-13")]
print(combo)


     show_id   type                                            title  \
0         s1  Movie                             Dick Johnson Is Dead   
774     s775  Movie                                         2 Hearts   
1366   s1367  Movie                                           Fatima   
1484   s1485  Movie                                 Cops and Robbers   
1499   s1500  Movie                                 The Midnight Sky   
1550   s1551  Movie                           A California Christmas   
1558   s1559  Movie                                     Giving Voice   
1560   s1561  Movie                                         The Prom   
1705   s1706  Movie                                   The Life Ahead   
1814   s1815  Movie                                          Rebecca   
1816   s1817  Movie                         Tremors: Shrieker Island   
1879   s1880  Movie                                  Hubie Halloween   
1900   s1901  Movie                           Vampires vs. the B

In [53]:
#Subsetting rows by categorical variables

# The RAtings
rate = ["PG-13", "TV-MA"]

# Filter for rows 
both_rate = netflix[netflix["rating"].isin(rate)]

# See the result
print(both_rate.head())

  show_id     type                  title         director  \
0      s1    Movie   Dick Johnson Is Dead  Kirsten Johnson   
1      s2  TV Show          Blood & Water              NaN   
2      s3  TV Show              Ganglands  Julien Leclercq   
3      s4  TV Show  Jailbirds New Orleans              NaN   
4      s5  TV Show           Kota Factory              NaN   

                                                cast        country  \
0                                                NaN  United States   
1  Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...   South Africa   
2  Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...            NaN   
3                                                NaN            NaN   
4  Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...          India   

           date_added  release_year rating   duration  \
0  September 25, 2021          2020  PG-13     90 min   
1  September 24, 2021          2021  TV-MA  2 Seasons   
2  September 24, 2021        

# Aggregating DataFrames

# Summary Statistics in pandas
1. Introduction to Summary Statistics
Summary statistics help describe and understand your dataset using numerical values.
Common metrics include mean, median, minimum, maximum, variance, and standard deviation.
2. Summarizing Numerical Data
Mean: Calculates the average value.
python
Copy
Edit
netflix["imdb_rating"].mean()
Other metrics:
median() – Middle value
min() / max() – Smallest / largest value
var() – Variance
std() – Standard deviation
sum() – Total sum of values
quantile(0.25) – 25th percentile (or any given percentile)
3. Summarizing Dates
Useful for understanding time-based data trends.
python
Copy
Edit
netflix["release_date"].min()  # Oldest release
netflix["release_date"].max()  # Most recent release
4. Custom Summaries with .agg() Method
Allows applying custom or multiple statistics on one or more columns.
Example: Finding the 30th percentile for IMDb ratings

python
Copy
Edit
# Custom function for percentile
def pct30(column):
    return column.quantile(0.30)

# Apply custom summary
netflix["imdb_rating"].agg(pct30)
Multiple columns:

python
Copy
Edit
netflix[["imdb_rating", "duration"]].agg(pct30)
Multiple summaries:

python
Copy
Edit
# 30th and 40th percentiles
def pct40(column):
    return column.quantile(0.40)

netflix["imdb_rating"].agg([pct30, pct40])
5. Cumulative Statistics
Useful for understanding how a metric evolves across rows.
Example: Cumulative total watch time

python
Copy
Edit
netflix["cumulative_duration"] = netflix["duration"].cumsum()
Other cumulative functions:

cummax() – Cumulative maximum
cummin() – Cumulative minimum
cumprod() – Cumulative product
6. Walmart Dataset Example
Analyze weekly sales, including store-specific data like:
store_id: Unique store identifier
weekly_sales: Total sales for the week
holiday: Boolean indicating if the week had a holiday
fuel_price: Average fuel price during the week
unemployment: National unemployment rate that week
Example: Total sales and mean unemployment rate

python
Copy
Edit
# Total sales
walmart["weekly_sales"].sum()

# Mean unemployment rate
walmart["unemployment"].mean()
7. Practice Time!
Try calculating summary statistics on the Netflix dataset or the Walmart data to get hands-on experience! 🚀
