# Pandas and Reading Data in Python

Pandas is a powerful Python library used for data manipulation and analysis. One of its key functionalities is the ability to read, process, and analyze structured data from various sources, such as CSV and Excel files.

In this class (and probably the next class too), we will explore how to read data using Pandas and apply basic operations to understand our dataset.

# Extracting and Reading Data in Python  

## Types of Data  
Extracting useful information from data starts with reading it into a usable format. Data can generally be classified into two types:  

1. **Structured Data**  
2. **Unstructured Data**  

### Structured Data  
Structured data is organized in a tabular format, where **rows** represent individual **observations** and **columns** represent **variables**. For example, the dataset below contains five observations, each representing a movie. The columns store different attributes such as title, budget, genre, and ratings. Since all attributes in a row relate to the same entity (a movie), this type of dataset is known as **relational data**.  

| Title                           | US Gross   | Production Budget | Release Date  | Major Genre       | Creative Type | Rotten Tomatoes Rating | IMDB Rating |
|---------------------------------|------------|------------------|--------------|------------------|--------------|----------------------|-------------|
| The Shawshank Redemption        | 28,241,469  | 25,000,000       | Sep 23, 1994  | Drama            | Fiction      | 88                   | 9.2         |
| Inception                       | 285,630,280 | 160,000,000      | Jul 16, 2010  | Horror/Thriller  | Fiction      | 87                   | 9.1         |
| One Flew Over the Cuckoo's Nest | 108,981,275 | 4,400,000        | Nov 19, 1975  | Comedy           | Fiction      | 96                   | 8.9         |
| The Dark Knight                 | 533,345,358 | 185,000,000      | Jul 18, 2008  | Action/Adventure | Fiction      | 93                   | 8.9         |
| Schindler's List                | 96,067,179  | 25,000,000       | Dec 15, 1993  | Drama            | Non-Fiction  | 97                   | 8.9         |

### Unstructured Data  
Unstructured data does not follow a predefined format or structure. Examples include text files, images, audio and video recordings, and Internet of Things (IoT) data. Since analytical tools are primarily designed for structured data, analyzing unstructured data can be more challenging. However, unstructured data can often be converted into a structured format. For example, an image can be transformed into a matrix of pixel values, allowing machine learning models to classify it as a dog or a cat.

---

## Reading CSV Files with Pandas  
Structured data can be stored in different formats, but **CSV (Comma-Separated Values)** is one of the most common. In a CSV file, values in each row are separated by commas, though these delimiters are not visible when opened in spreadsheet applications like Microsoft Excel.

To read a CSV file in Python using the Pandas library:







In [9]:
# Import the pandas library
import pandas as pd

# Load the CSV file into a DataFrame
df = pd.read_csv("Movie-Data.csv")

# Display the first few rows
print(df.head())




   Rank                    Title                     Genre  \
0     1  Guardians of the Galaxy   Action,Adventure,Sci-Fi   
1     2               Prometheus  Adventure,Mystery,Sci-Fi   
2     3                    Split           Horror,Thriller   
3     4                     Sing   Animation,Comedy,Family   
4     5            Suicide Squad  Action,Adventure,Fantasy   

                                         Description              Director  \
0  A group of intergalactic criminals are forced ...            James Gunn   
1  Following clues to the origin of mankind, a te...          Ridley Scott   
2  Three girls are kidnapped by a man with a diag...    M. Night Shyamalan   
3  In a city of humanoid animals, a hustling thea...  Christophe Lourdelet   
4  A secret government agency recruits some of th...            David Ayer   

                                              Actors  Year  Runtime (Minutes)  \
0  Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...  2014                121

## Exploring the Data

Once you've read the data into a DataFrame, you can explore it using Pandas functions:

- `df.head()` – View the first five rows of the dataset.
- `df.tail()` – View the last five rows.
- `df.shape` – Get the number of rows and columns.
- `df.info()` – Get metadata about the dataset.
- `df.describe()` – Get statistical summary of numerical columns.


In [None]:
#today we are having fun and taking notes!

















# Display basic information about the dataset
print(df.info())

# Summary statistics
print(df.describe())


## Handling Missing Data

Datasets often contain missing values. Pandas provides several methods to handle them:

- `df.dropna()` – Removes rows with missing values.
- `df.fillna(value)` – Fills missing values with a specified value.
- `df.isnull().sum()` – Counts missing values in each column.


In [None]:
# Count missing values in each column
print(df.isnull().sum())

# Fill missing values with 0
df_filled = df.fillna(0)

# Drop rows with missing values
df_dropped = df.dropna()


## Saving Data

After making modifications to your dataset, you may want to save it back to a file:

- `df.to_csv('new_data.csv', index=False)` – Saves the DataFrame to a CSV file.
- `df.to_excel('new_data.xlsx')` – Saves the DataFrame to an Excel file.


In [None]:
# Save the cleaned dataset to a new CSV file
df_filled.to_csv('cleaned_data.csv', index=False)
