## Pandas in a Nutshell

Pandas is a powerful Python library designed for data cleaning, manipulation, and analysis. While NumPy excels with homogeneous numerical arrays, Pandas is tailored for handling tabular or heterogeneous data structures. Built atop NumPy (which we will learn next week), Pandas shares functionalities like addition, subtraction, conditional operations, and broadcasting. However, unlike NumPy's multi-dimensional arrays, Pandas introduces the DataFrame—a two-dimensional table ideal for diverse data types.​

Pandas seamlessly integrates with libraries such as SciPy for statistical analysis, Matplotlib for data visualization, and Scikit-learn for machine learning applications.​

Common Uses of Pandas:
* Data Cleaning: Handling missing values, filtering rows/columns, aggregating, and transforming data.
* Statistical Computation: Calculating metrics like mean, median, max, min, and standard deviation.
* Correlation Analysis: Assessing relationships between data columns.
* Data Distribution Analysis: Understanding the spread and frequency of data points.
* Data Visualization: Creating plots and charts, often in conjunction with Matplotlib.
* Data Exporting: Saving processed data to formats like CSV or databases.




### Pandas offers two primary data structures:

* Series: A one-dimensional labeled array capable of holding any data type.
* DataFrame: A two-dimensional labeled data structure with columns of potentially different types, akin to a spreadsheet or SQL table.

Each DataFrame consists of:
* Row Labels (Index): Identifiers for rows, defaulting to integers starting from zero but customizable.
* Column Labels (Column Names): Identifiers for columns, which can be defined by the user.​
Creating a DataFrame from a CSV File:

## Exploring DataFrames:
Once data is loaded into a DataFrame, you can explore and manipulate it using various attributes and methods.

### Attributes:
* dtypes: Returns the data types of each column.
* columns: Lists the column names.
* index: Displays the row index labels.
* shape: Provides the dimensions of the DataFrame (rows, columns).
* size: Gives the total number of elements in the DataFrame.
### Methods:
* head(n): Displays the first n rows.
* tail(n): Shows the last n rows.
* describe(): Generates summary statistics for numerical columns.
* dropna(): Removes rows with missing values.
* fillna(value): Replaces missing values with the specified value.
* sort_values(by): Sorts the DataFrame based on the specified column.
* groupby(by): Groups the DataFrame using a specified column for aggregation.


### More Methods for Data Manipulation with Pandas:
* Subsetting Data: Selecting specific rows and columns using loc (label-based indexing) and iloc (integer-based indexing).
* Sorting Data: Arranging data based on column values using sort_values().
* Ranking Data: Assigning ranks to data points within a column using rank().







In [6]:
import pandas as pd
al=pd.read_csv("Top 10 Albums By Year.csv")


print("Data Types:\n", al.dtypes)  # Data types of each column
print("\nColumns:\n", al.columns)  # List of column names
print("\nIndex:\n", al.index)  # Row index labels
print("\nShape (Rows, Columns):", al.shape)  # Dimensions of DataFrame
print("\nSize (Total Elements):", al.size)  # Total number of elements


# Display the first 5 rows
print("\nFirst 5 Rows:\n", al.head())

# Display the last 5 rows
print("\nLast 5 Rows:\n", al.tail())

# Summary statistics for numerical columns
print("\nSummary Statistics:\n", al.describe())

# Remove rows with missing values
al_cleaned = al.dropna()
print("\nData After Dropping NA Rows:\n", al_cleaned.head())

# Fill missing values with a placeholder (e.g., "Unknown")
al_filled = al.fillna("Unknown")
print("\nData After Filling NA Values:\n", al_filled.head())

# Sort by a column (change 'Year' to an actual column name from your dataset)
sorted_al = al.sort_values(by="Year")
print("\nData Sorted by Year:\n", sorted_al.head())

# Group by a column and count occurrences (change 'Artist' to a valid column)
grouped_al = al.groupby("Artist").size()
print("\nGrouped Data by Artist:\n", grouped_al)



# Subsetting data using label-based indexing (select first 5 rows and specific columns)
subset1 = al.loc[:5, ["Year", "Artist"]]
print("\nSubset Using loc:\n", subset1)

# Subsetting data using integer-based indexing
subset2 = al.iloc[:5, :2]  # First 5 rows, first 2 columns
print("\nSubset Using iloc:\n", subset2)

# Sorting data by Year and Album Name
sorted_albums = al.sort_values(by=["Year", "Album"])
print("\nSorted Albums by Year and Name:\n", sorted_albums.head())

# Assigning ranks to albums based on a numeric column (assuming 'Sales' exists)
if "Sales" in al.columns:
    al["Rank"] = al["Sales"].rank(ascending=False)
    print("\nRanked Albums Based on Sales:\n", al[["Album", "Sales", "Rank"]].head())

# Arithmetic operations (if numeric columns exist)
if "Sales" in al.columns:
    al["Sales_in_Millions"] = al["Sales"] / 1_000_000  # Convert to millions
    print("\nSales in Millions:\n", al[["Album", "Sales_in_Millions"]].head())


Data Types:
 Year                 int64
Ranking              int64
Artist              object
Album               object
Worldwide Sales     object
CDs                  int64
Tracks               int64
Album Length        object
Hours              float64
Minutes            float64
Seconds              int64
Genre               object
dtype: object

Columns:
 Index(['Year', 'Ranking', 'Artist', 'Album', 'Worldwide Sales', 'CDs',
       'Tracks', 'Album Length', 'Hours', 'Minutes', 'Seconds', 'Genre'],
      dtype='object')

Index:
 RangeIndex(start=0, stop=320, step=1)

Shape (Rows, Columns): (320, 12)

Size (Total Elements): 3840

First 5 Rows:
    Year  Ranking            Artist  \
0  1990        8      Phil Collins   
1  1990        1           Madonna   
2  1990       10  The Three Tenors   
3  1990        4         MC Hammer   
4  1990        6  Movie Soundtrack   

                                        Album Worldwide Sales  CDs  Tracks  \
0                       Serious Hits..

In [8]:
#Creating two  DataFrames
df1 = pd.DataFrame([(1,2),(3,4),(5,6)], columns=['a','b'])
df2 = pd.DataFrame([(100,200),(300,400),(500,600)], columns=['a','b'])

df1+df2

Unnamed: 0,a,b
0,101,202
1,303,404
2,505,606
