## What is pandas?
Pandas is an open-source Python library used for data manipulation and analysis. It provides data structures and functions designed to make working with structured data fast, easy, and expressive. Pandas is widely used in data science, machine learning, and other fields where data analysis is crucial.

### Main Features of pandas:

DataFrame: The primary data structure in pandas is the DataFrame, which is a two-dimensional labeled data structure with columns of potentially different types. It's similar to a spreadsheet or SQL table, where data is organized into rows and columns.

Series: Pandas also provides the Series data structure, which is a one-dimensional labeled array capable of holding any data type. A DataFrame essentially consists of one or more Series.

Data Input/Output: Pandas provides functions to read data from various file formats, including CSV, Excel, JSON, SQL databases, and more. It also supports writing data to these formats.

Indexing and Selection: Pandas allows for easy indexing and selection of data, allowing you to select rows and columns based on labels, positions, or conditions.

Data Cleaning and Preparation: Pandas offers tools for handling missing data, converting data types, removing duplicates, and other data cleaning tasks. It also provides functions for reshaping and transforming data.

Grouping and Aggregation: Pandas supports grouping data by one or more keys and performing aggregation operations such as sum, mean, count, etc., on the grouped data.

Time Series Analysis: Pandas has extensive support for working with time series data, including date/time indexing, resampling, and time zone handling.

Data Visualization: While pandas itself does not provide visualization capabilities, it integrates well with other libraries like Matplotlib and Seaborn for creating plots and charts.

In [None]:
import pandas as pd

### Series

A pandas Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating-point numbers, Python objects, etc.).

In [None]:
marvel_actors_list = ['Robert Downey Jr.',
                      'Chris Evans',
                      'Scarlett Johansson',
                      'Tom Holland',
                      'Chris Pratt',
                      'Mark Ruffalo',
                      'Brie Larson',
                      'Zoe Saldana',
                      'Paul Rudd',
                      'Josh Brolin',
                      'Tom Hiddleston',
                      'Anthony Mackie',
                      'Chris Hemsworth',
                      'Benedict Cumberbatch',
                      'Jeremy Renner',
                      'Chadwick Boseman',
                      'Karen Gillan',
                      'Elizabeth Olsen',
                      'Dave Bautista',
                      'Chris Hemsworth']

In [None]:
series_data = pd.Series(marvel_actors_list)
series_data

0        Robert Downey Jr.
1              Chris Evans
2       Scarlett Johansson
3              Tom Holland
4              Chris Pratt
5             Mark Ruffalo
6              Brie Larson
7              Zoe Saldana
8                Paul Rudd
9              Josh Brolin
10          Tom Hiddleston
11          Anthony Mackie
12         Chris Hemsworth
13    Benedict Cumberbatch
14           Jeremy Renner
15        Chadwick Boseman
16            Karen Gillan
17         Elizabeth Olsen
18           Dave Bautista
19         Chris Hemsworth
dtype: object

### Data types in Pandas

int64: Integer values. It's a 64-bit integer which allows for large numbers.

float64: Floating-point numbers. It's a 64-bit floating-point number which allows for decimal values.

object: Represents strings or mixed data types. It's a catch-all for columns with mixed types or are unable to be represented as other types.

bool: Boolean values, either True or False.

datetime64: Represents date and time data.

timedelta: Represents the difference between two datetime values.

category: Represents categorical data. It's useful for columns with a limited number of unique values.

### DataFrame

A DataFrame in pandas is a two-dimensional labeled data structure capable of holding data of different types. It is similar to a spreadsheet or SQL table, where data is organized into rows and columns.

In a DataFrame:

Rows are labeled with an index, which can be either integers or strings.

Columns are labeled with column names, which are typically strings.

Each column can contain different types of data, such as integers, floats, strings, or even Python objects.

DataFrames can be created from various data sources, such as lists, dictionaries, CSV files, Excel files, SQL databases, and more.

In [None]:
marvel_actors = {
    'first_name': ['Robert', 'Chris', 'Scarlett', 'Tom', 'Chris', 'Mark', 'Brie', 'Zoe', 'Paul', 'Josh', 'Tom', 'Anthony', 'Chris', 'Benedict', 'Jeremy', 'Chadwick', 'Karen', 'Elizabeth', 'Dave', 'Chris'],
    'last_name': ['Downey Jr.', 'Evans', 'Johansson', 'Holland', 'Pratt', 'Ruffalo', 'Larson', 'Saldana', 'Rudd', 'Brolin', 'Hiddleston', 'Mackie', 'Hemsworth', 'Cumberbatch', 'Renner', 'Boseman', 'Gillan', 'Olsen', 'Bautista', 'Hemsworth'],
    'email': ['robert.downey@example.com', 'chris.evans@example.com', 'scarlett.johansson@example.com', 'tom.holland@example.com', 'chris.pratt@example.com', 'mark.ruffalo@example.com', 'brie.larson@example.com', 'zoe.saldana@example.com', 'paul.rudd@example.com', 'josh.brolin@example.com', 'tom.hiddleston@example.com', 'anthony.mackie@example.com', 'chris.hemsworth@example.com', 'benedict.cumberbatch@example.com', 'jeremy.renner@example.com', 'chadwick.boseman@example.com', 'karen.gillan@example.com', 'elizabeth.olsen@example.com', 'dave.bautista@example.com', 'chris.hemsworth@example.com'],
    'age': [56, 40, 37, 25, 42, 54, 32, 43, 52, 53, 40, 43, 38, 45, 50, 43, 33, 33, 53, 38],
    'number_of_marvel_movies': [9, 7, 7, 5, 5, 6, 4, 6, 4, 4, 4, 4, 4, 4, 6, 4, 4, 4, 3, 3],
    'recent_movie': ['Avengers: Endgame', 'Captain America: Civil War', 'Black Widow', 'Spider-Man: Far From Home', 'Guardians of the Galaxy Vol. 3', 'Thor: Ragnarok', 'Captain Marvel', 'Guardians of the Galaxy Vol. 3', 'Ant-Man and the Wasp', 'Avengers: Infinity War', 'Thor: Ragnarok', 'Avengers: Infinity War', 'Thor: Love and Thunder', 'Doctor Strange in the Multiverse of Madness', 'Avengers: Endgame', 'Black Panther', 'Avengers: Endgame', 'Avengers: Age of Ultron', 'Guardians of the Galaxy Vol. 2', 'Thor: Ragnarok'],
    'recent_movie_rating': [9.0, 8.5, 7.8, 8.2, 8.6, 8.4, 7.9, 8.1, 7.7, 8.9, 8.3, 8.8, 8.5, 7.6, 9.1, 8.7, 8.0, 7.5, 7.9, 8.4],
    'average_rating_of_all_movies': [8.4, 8.2, 7.9, 7.5, 8.0, 7.8, 8.0, 7.6, 7.9, 8.1, 7.7, 8.3, 7.9, 8.2, 8.0, 8.5, 7.6, 7.8, 7.7, 8.2],
    'net_worth': ['$300 million', '$80 million', '$165 million', '$15 million', '$60 million', '$35 million', '$25 million', '$35 million', '$70 million', '$35 million', '$25 million', '$20 million', '$130 million', '$30 million', '$60 million', '$40 million', '$7 million', '$11 million', '$20 million', '$90 million']
}


In [None]:
marvel_actors_df = pd.DataFrame(marvel_actors)
marvel_actors_df

Unnamed: 0,first_name,last_name,email,age,number_of_marvel_movies,recent_movie,recent_movie_rating,average_rating_of_all_movies,net_worth
0,Robert,Downey Jr.,robert.downey@example.com,56,9,Avengers: Endgame,9.0,8.4,$300 million
1,Chris,Evans,chris.evans@example.com,40,7,Captain America: Civil War,8.5,8.2,$80 million
2,Scarlett,Johansson,scarlett.johansson@example.com,37,7,Black Widow,7.8,7.9,$165 million
3,Tom,Holland,tom.holland@example.com,25,5,Spider-Man: Far From Home,8.2,7.5,$15 million
4,Chris,Pratt,chris.pratt@example.com,42,5,Guardians of the Galaxy Vol. 3,8.6,8.0,$60 million
5,Mark,Ruffalo,mark.ruffalo@example.com,54,6,Thor: Ragnarok,8.4,7.8,$35 million
6,Brie,Larson,brie.larson@example.com,32,4,Captain Marvel,7.9,8.0,$25 million
7,Zoe,Saldana,zoe.saldana@example.com,43,6,Guardians of the Galaxy Vol. 3,8.1,7.6,$35 million
8,Paul,Rudd,paul.rudd@example.com,52,4,Ant-Man and the Wasp,7.7,7.9,$70 million
9,Josh,Brolin,josh.brolin@example.com,53,4,Avengers: Infinity War,8.9,8.1,$35 million


### Series Vs DataFrames

The main difference between a Series and a DataFrame in pandas lies in their dimensions and structure:

Dimensionality:

Series: A Series is a one-dimensional labeled array capable of holding data of any type (integers, floats, strings, etc.). It is essentially a single column of data with an associated index.
DataFrame: A DataFrame is a two-dimensional labeled data structure, similar to a spreadsheet or SQL table. It consists of rows and columns, where each column can have a different data type.


Structure:

Series: A Series has a single column of data and an index. It is like a specialized dictionary or NumPy array.
DataFrame: A DataFrame has multiple columns of data, each with a unique column name. It is a tabular data structure, where each column can be of a different data type.


Use Cases:

Series: Series are typically used for storing one-dimensional data, such as time series data, sensor readings, or single variables.
DataFrame: DataFrames are used for storing and working with two-dimensional data, such as structured datasets with multiple variables or attributes.

# loc vs iloc

loc and iloc are both indexer attributes in pandas used for indexing and selection, but they have different behaviors and use cases:

loc:
Label-based Indexing:

loc is primarily used for label-based indexing, meaning you use index labels or column names to select data.
When using loc, both the start and stop indices are inclusive.
Syntax:

dataframe.loc[row_label, column_label]
Example:


Accessing a single element by label
value = dataframe.loc['row_label', 'column_label']
iloc:
Integer-based Indexing:

iloc is used for integer-based indexing, where you specify row and column positions as integers to select data.
When using iloc, the stop index is exclusive, following Python's standard slicing behavior.
Syntax:


dataframe.iloc[row_position, column_position]
Example:



Accessing a single element by position
value = dataframe.iloc[row_position, column_position]
Differences:
Indexing Method:

loc uses labels to index data, while iloc uses integer positions.
Inclusivity:

With loc, both the start and stop indices are inclusive.
With iloc, the stop index is exclusive, consistent with Python's standard slicing behavior.
Usage:

Use loc when you need to access data based on index labels or column names.
Use iloc when you need to access data based on integer positions.

In [None]:
import pandas as pd

# Creating a DataFrame
data = {'A': [1, 2, 3],
        'B': [4, 5, 6],
        'C': [7, 8, 9]}
df = pd.DataFrame(data, index=['X', 'Y', 'Z'])

# Using loc
print(df.loc['Y', 'B'])  # Output: 5

# Using iloc
print(df.iloc[1, 1])  # Output: 5


5
5


In [None]:
marvel_actors_df.index

RangeIndex(start=0, stop=20, step=1)

In [None]:
marvel_actors_df.columns

Index(['first_name', 'last_name', 'email', 'age', 'number_of_marvel_movies',
       'recent_movie', 'recent_movie_rating', 'average_rating_of_all_movies',
       'net_worth'],
      dtype='object')

In [None]:
# setting index
marvel_actors_df.set_index("email",  inplace = True)

In [None]:
marvel_actors_df.head()

Unnamed: 0_level_0,first_name,last_name,age,number_of_marvel_movies,recent_movie,recent_movie_rating,average_rating_of_all_movies,net_worth
email,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
robert.downey@example.com,Robert,Downey Jr.,56,9,Avengers: Endgame,9.0,8.4,$300 million
chris.evans@example.com,Chris,Evans,40,7,Captain America: Civil War,8.5,8.2,$80 million
scarlett.johansson@example.com,Scarlett,Johansson,37,7,Black Widow,7.8,7.9,$165 million
tom.holland@example.com,Tom,Holland,25,5,Spider-Man: Far From Home,8.2,7.5,$15 million
chris.pratt@example.com,Chris,Pratt,42,5,Guardians of the Galaxy Vol. 3,8.6,8.0,$60 million


In [None]:
# resetting index
marvel_actors_df.reset_index(inplace = True)

In [None]:
marvel_actors_df.head()

Unnamed: 0,email,first_name,last_name,age,number_of_marvel_movies,recent_movie,recent_movie_rating,average_rating_of_all_movies,net_worth
0,robert.downey@example.com,Robert,Downey Jr.,56,9,Avengers: Endgame,9.0,8.4,$300 million
1,chris.evans@example.com,Chris,Evans,40,7,Captain America: Civil War,8.5,8.2,$80 million
2,scarlett.johansson@example.com,Scarlett,Johansson,37,7,Black Widow,7.8,7.9,$165 million
3,tom.holland@example.com,Tom,Holland,25,5,Spider-Man: Far From Home,8.2,7.5,$15 million
4,chris.pratt@example.com,Chris,Pratt,42,5,Guardians of the Galaxy Vol. 3,8.6,8.0,$60 million


In [None]:
# how to move columns
# pop removes the specified coulmn, it does it inplace
email_col = marvel_actors_df.pop('email')

In [None]:
# insert(location, 'new_name', series to insert)
marvel_actors_df.insert(2, 'email', email_col)

In [None]:
marvel_actors_df.head()

Unnamed: 0,first_name,last_name,email,age,number_of_marvel_movies,recent_movie,recent_movie_rating,average_rating_of_all_movies,net_worth
0,Robert,Downey Jr.,robert.downey@example.com,56,9,Avengers: Endgame,9.0,8.4,$300 million
1,Chris,Evans,chris.evans@example.com,40,7,Captain America: Civil War,8.5,8.2,$80 million
2,Scarlett,Johansson,scarlett.johansson@example.com,37,7,Black Widow,7.8,7.9,$165 million
3,Tom,Holland,tom.holland@example.com,25,5,Spider-Man: Far From Home,8.2,7.5,$15 million
4,Chris,Pratt,chris.pratt@example.com,42,5,Guardians of the Galaxy Vol. 3,8.6,8.0,$60 million


In [None]:
# filtering
filt = (marvel_actors_df['first_name'] == 'Tom') |  (marvel_actors_df['last_name'] == 'Hemsworth') | (marvel_actors_df['age'] > 50) & (marvel_actors_df['number_of_marvel_movies'] > 5)
filt

0      True
1     False
2     False
3      True
4     False
5      True
6     False
7     False
8     False
9     False
10     True
11    False
12     True
13    False
14    False
15    False
16    False
17    False
18    False
19     True
dtype: bool

In [None]:
marvel_actors_df.loc[filt]

Unnamed: 0,first_name,last_name,email,age,number_of_marvel_movies,recent_movie,recent_movie_rating,average_rating_of_all_movies,net_worth
0,Robert,Downey Jr.,robert.downey@example.com,56,9,Avengers: Endgame,9.0,8.4,$300 million
3,Tom,Holland,tom.holland@example.com,25,5,Spider-Man: Far From Home,8.2,7.5,$15 million
5,Mark,Ruffalo,mark.ruffalo@example.com,54,6,Thor: Ragnarok,8.4,7.8,$35 million
10,Tom,Hiddleston,tom.hiddleston@example.com,40,4,Thor: Ragnarok,8.3,7.7,$25 million
12,Chris,Hemsworth,chris.hemsworth@example.com,38,4,Thor: Love and Thunder,8.5,7.9,$130 million
19,Chris,Hemsworth,chris.hemsworth@example.com,38,3,Thor: Ragnarok,8.4,8.2,$90 million


In [None]:
# updating rows and columns
"""
1. update columns by assigning new columns list to the df.columns
2. We can also use the list comprehension to update the columns
3. We can also use str methods to alter the column names
4. we can use the rename columns by using df.rename method
5. To update the specific row get that row by using indexers and assign the new row to it
6. If we want to update the specific columns, get the rows and columns by using indexers
7. If we want to update the rows based on some conditions, use the filters and assign new values for that
8. apply, map, applymap, replace
"""

'\n1. update columns by assigning new columns list to the df.columns\n2. We can also use the list comprehension to update the columns\n3. We can also use str methods to alter the column names\n4. we can use the rename columns by using df.rename method\n5. To update the specific row get that row by using indexers and assign the new row to it\n6. If we want to update the specific columns, get the rows and columns by using indexers\n7. If we want to update the rows based on some conditions, use the filters and assign new values for that\n8. apply, map, applymap, replace\n'

In [None]:
# adding and removing rows and columns


In [None]:
# Sorting dataframes based on columns
"""
1. To sort dataframe we can use the df.sort_values() method
2. To sort dataframes based on some columns use 'by' parameter
3. To sort data in asc or dsc order use 'ascending' parameter
4. we can sort data based on multiple columns and in different orders
5. If we want to get the n largest or n smallest values, we can use the df.nlargest() or df.nsmallest()
"""

"\n1. To sort dataframe we can use the df.sort_values() method\n2. To sort dataframes based on some columns use 'by' parameter\n3. To sort data in asc or dsc order use 'ascending' parameter\n4. we can sort data based on multiple columns and in different orders\n5. If we want to get the n largest or n smallest values, we can use the df.nlargest() or df.nsmallest()\n"

In [None]:
# Aggregation and Grouping
"""
1. Aggregation functions(sum, min, max, count, mean, meadian, mode, std)
2. To get imp statistics use df.describe() method
3. To count occurrences of each value use df.value_counts() method
4. grouping: split --> apply function --> combine
5. To group the data on some categories use df.groupby() method and pass the columns we want to group together
6. Grouping is similar to applying a filter and performing some operation on the filtered data
7. To apply multiple aggregation funcitons on a series use df.agg() method and pass the aggregation functions we want to perform
8.

"""

In [None]:
# Cleaning Data - Casting Datatypes and Handling Missing Values
"""
1. To drop na(not a number) values from dataframe use df.dropna() method
2. To drop columns with missing values set axis to 'column'
3. To drop rows or columns with all the missing values set 'how' parameter to 'all'
4. To drop rows or columns with any the missing values set 'how' parameter to 'any'
5. If we want to check missing values from subset of columns only set 'subset' to the list of columns we want to check
6. If our df has custom missing values we can replace them with na by using df.replace() method (use np.nan of numpy)
7. To get the mask of na values we can use df.isna() method
8. To fill the na values we can use df.fillna(<value to fill>) method
9. To change the type of a column we can use the 'astype' method and pass the datatype
10. To change the datatype off all or some columns of dataframe pass a dict of columns and datatypes
"""