<a href="https://colab.research.google.com/github/MonkeyWrenchGang/PythonBootcamp/blob/main/day_2/2_4_Pandas_101.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing CSV Files with Pandas


---


### Topic: Importing CSV Files with Pandas

In this section, we will explore the process of importing CSV files using the Pandas library. Pandas is a powerful data manipulation and analysis library in Python that provides high-performance, easy-to-use data structures such as DataFrames and Series.

### Introduction to Pandas

Pandas is built on top of NumPy and provides a wide range of data manipulation and analysis functions, making it a popular choice for data processing tasks. It allows for efficient handling and manipulation of structured data, including CSV files.

### DataFrames and Series

Pandas introduces two primary data structures: DataFrames and Series.

- **DataFrame**: A DataFrame is a 2-dimensional tabular data structure that can store data of different types in columns. It is similar to a spreadsheet or a SQL table and provides powerful functionalities for indexing, slicing, filtering, and aggregating data.

- **Series**: A Series is a one-dimensional labeled array that can hold any data type. It is similar to a column in a DataFrame or a single-dimensional array and is often used to represent a single column or a single row of data.

By utilizing DataFrames and Series, Pandas simplifies the process of data loading, manipulation, and analysis, making it an essential tool for working with structured data.

---

In the next section, we will dive into the process of importing CSV files using Pandas and explore various operations that can be performed on the loaded data.



## Introduction to Importing Libraries in Python

---

In Python, libraries are pre-built collections of functions, modules, and classes that provide specific functionality to simplify programming tasks. These libraries offer a wide range of capabilities, from data manipulation and analysis to web scraping, machine learning, and more.

### Why Import Libraries?

- **Code Reusability**: Libraries allow you to leverage existing code and avoid reinventing the wheel. By importing libraries, you can access pre-built functions and modules to perform common tasks without having to write everything from scratch.

- **Productivity and Efficiency**: Libraries provide ready-to-use solutions for complex tasks, saving you time and effort. They offer optimized and tested implementations, allowing you to focus on solving higher-level problems rather than low-level details.

- **Access to Specialized Functionality**: Libraries are often developed by experts in specific domains, such as data analysis, scientific computing, or web development. Importing these libraries gives you access to specialized functions and algorithms tailored for specific tasks.

### Importing Libraries in Python

To use a library in your Python program, you need to import it. The `import` statement is used to bring in the desired library, module, or specific components from a library into your program's namespace.

There are different ways to import libraries, such as:

- `import library_name`: Imports the entire library, and you can access its functions and modules using the library name as a prefix.

- `import library_name as alias`: Imports the library and assigns an alias to it, allowing you to use a shorter or more convenient name when referring to the library.

- `from library_name import module_name`: Imports a specific module from the library, making its functions and classes directly accessible without the need for the library name prefix.



---

## Let's Start by Importing Pandas

```python
import pandas as pd
```




In [None]:
import pandas as pd

# Importing CSV Files


---


In this section, we will explore the process of importing CSV (Comma-Separated Values) files in Python. CSV files are a common format for storing tabular data, and importing them allows us to work with structured data in our programs.

### Using read_csv() Function

In Python, the Pandas library provides the `read_csv()` function, which allows us to read CSV files and load them into a DataFrame. The DataFrame is a powerful data structure that simplifies data manipulation and analysis.

### Reading CSV files as DataFrames

By using the `read_csv()` function, we can read the contents of a CSV file and create a DataFrame object. The function automatically infers the data types and handles various CSV formats, making it convenient for data loading.

### Displaying and Accessing Data

Once the CSV file is loaded into a DataFrame, we can explore its structure and content. We can display the entire DataFrame to get an overview of the data, or we can access specific rows, columns, or cells using indexing and column names.

---




In [None]:
import pandas as pd

# Import the Titanic dataset from a CSV file
titanic_data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Display the first few rows of the dataset
print("First few rows of the Titanic dataset:")
titanic_data.head()



First few rows of the Titanic dataset:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Pandas `describe()`

The `describe()` function in Pandas provides a statistical summary of the numerical columns in a DataFrame. It calculates various descriptive statistics, including count, mean, standard deviation, minimum, quartiles, and maximum.

### Syntax
```python
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
                   'B': [5, 10, 15, 20, 25],
                   'C': [10, 20, 30, 40, 50]})

# Calculate the descriptive statistics
summary = df.describe()

# Display the summary statistics
print(summary)

```
## `describe()` Output

The output of the `describe()` function in Pandas provides several statistical measures for each numerical column in a DataFrame. Here's an interpretation of each statistic:

- `count`: The count represents the number of non-null values in each column. It indicates the completeness of the data. If the count is lower than the total number of rows, it suggests missing or null values.

- `mean`: The mean (average) represents the central tendency of the data. It provides an estimate of the typical value for each column. The mean is calculated by summing all values and dividing by the count.

- `std`: The standard deviation represents the dispersion or spread of the data. It measures the variability or deviation from the mean. A higher standard deviation indicates greater variability in the data.

- `min`: The minimum represents the smallest value observed in each column. It provides the lower bound of the data range. Any values below the minimum are considered outliers.

- `25%`, `50%`, `75%`: These are the quartiles of the data, also known as percentiles. The `25%` represents the first quartile or the lower quartile, indicating the value below which 25% of the data falls. The `50%` represents the second quartile or the median, indicating the value below which 50% of the data falls. The `75%` represents the third quartile or the upper quartile, indicating the value below which 75% of the data falls.

- `max`: The maximum represents the largest value observed in each column. It provides the upper bound of the data range. Any values above the maximum are considered outliers.

The `describe()` function in Pandas allows for a quick and concise summary of the statistical measures of the numerical columns in a DataFrame, providing insights into the distribution and characteristics of the data.



In [None]:
# Perform descriptive statistics on numerical columns
print("\nDescriptive statistics of the Titanic dataset:")
titanic_data.describe()



Descriptive statistics of the Titanic dataset:


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## Pandas `info()`

The `info()` function in Pandas provides a concise summary of a DataFrame's structure and content. It displays information about the column names, data types, and non-null counts for each column, along with the total number of entries in the DataFrame.

### Syntax
```python
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
                   'B': ['John', 'Jane', 'Alex', 'Lisa', 'Mike'],
                   'C': [10.5, 7.2, 4.9, 3.1, 8.6]})

# Get information about the DataFrame
df.info()
```



## Output

- `RangeIndex`: Indicates the range of index values, from 0 to the total number of entries minus one.
- `Data columns`: Provides information about each column, including the column name, non-null count, and data type.
- `Dtype`: Specifies the data type of each column.
- `memory usage`: Displays the memory usage of the DataFrame.

The `info()` function in Pandas is a useful tool to quickly understand the structure and characteristics of a DataFrame, including data types, non-null counts, and memory usage.




In [None]:
# Get information about the dataset
print("\nSummary of the Titanic dataset:")
titanic_data.info()



Summary of the Titanic dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## Accessing Specific Column and Series using Single `[]`

In Pandas, you can access specific columns and series in a DataFrame using single `[]` brackets. Here's an explanation of how it works:

- **Accessing a Specific Column**: To access a specific column, you can use the column name within the brackets. This returns a series containing the values of that column. For example, `df['column_name']` retrieves the values in the column named `'column_name'`.

- **Accessing a Series**: If you want to access a specific series within a DataFrame, you can use the same syntax as accessing a column. The series can be a column or a computed result of an operation on columns. For example, `df['column_name']` returns the series of values in the column named `'column_name'`, and `df['column1'] + df['column2']` returns a series that is the sum of `'column1'` and `'column2'`.

This way of accessing a single columns aka series in a DataFrame allows you to retrieve and work with a single column of datafor analysis, computations, or further manipulations.



In [None]:
# Access specific columns
print("\nPassenger ages from the Titanic dataset:")
print(titanic_data["Age"].head())




Passenger ages from the Titanic dataset:
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64


In [None]:
titanic_data["Age"].mean()

29.69911764705882

## Filtering the Dataset based on Conditions

In Pandas, you can filter a dataset based on specific conditions to retrieve the subset of data that meets certain criteria. Here's an explanation of how to filter a dataset based on conditions:

- **Using Boolean Conditions**: To filter a dataset, you can use boolean conditions inside square brackets (`[]`) following the DataFrame name. The conditions are created using comparison operators (`<`, `>`, `==`, `!=`, etc.) to check if each value in a column satisfies the condition. For example, `df[df['column'] > 10]` filters the DataFrame to only include rows where the values in the column named `'column'` are greater than 10.

- **Combining Multiple Conditions**: You can combine multiple conditions using logical operators such as `&` (AND) and `|` (OR). For example, `df[(df['column1'] > 10) & (df['column2'] == 'value')]` filters the DataFrame to include rows where the values in `'column1'` are greater than 10 AND the values in `'column2'` are equal to `'value'`.

- **Using Methods**: Pandas also provides convenient methods like `query()` and `loc[]` to filter datasets based on conditions. The `query()` method allows you to write condition expressions as strings, while the `loc[]` method provides a way to access and filter data based on labels or boolean arrays.

what you'll find is that you create a "result set" data frame when doing this notice the query below of adult_passengers is a result of the filtering for people >= 18 years old.


In [None]:
# Filter the dataset based on conditions
adult_passengers = titanic_data[titanic_data["Age"] >= 18]
print("\nAdult passengers from the Titanic dataset:")
adult_passengers.head()


Adult passengers from the Titanic dataset:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
# Filter the dataset based on conditions
child_passengers = titanic_data[(titanic_data["Age"] < 18) & (titanic_data["Survived"] == 1)]
print("\nAdult passengers from the Titanic dataset:")
child_passengers.head()


Adult passengers from the Titanic dataset:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
22,23,1,3,"McGowan, Miss. Anna ""Annie""",female,15.0,0,0,330923,8.0292,,Q
39,40,1,3,"Nicola-Yarred, Miss. Jamila",female,14.0,1,0,2651,11.2417,,C
43,44,1,2,"Laroche, Miss. Simonne Marie Anne Andree",female,3.0,1,2,SC/Paris 2123,41.5792,,C


## The `.value_counts()` Method

The `.value_counts()` method in Pandas is a convenient way to quickly count the occurrences of unique values in a column of a DataFrame or a Series. It returns a new Series that displays the counts of each unique value in descending order.

### Syntax

```python
series.value_counts()

df['column'].value_counts()
```

### Exercise
1. Use `.value_counts()` to count the titanic "Survived"
2. Use `.value_counts()` to count the titanic "Pclass"
3. Use `.value_counts()` to count the titanic "Age"

In [None]:

titanic_data["Survived"].value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [None]:
titanic_data["Pclass"].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [None]:
titanic_data["Age"].value_counts()

24.00    30
22.00    27
18.00    26
19.00    25
28.00    25
         ..
36.50     1
55.50     1
0.92      1
23.50     1
74.00     1
Name: Age, Length: 88, dtype: int64

## More Value Counts
By setting `normalize=True`, the value_counts() function will return the **normalized value counts**, representing the proportion of each unique value over the total count of values in the series or column. The resulting values will range between 0 and 1, where 1 indicates 100% occurrence.

This is especially useful when analyzing survey responses, categorical variables, or exploring patterns within datasets. It helps us gain insights into the relative prevalence of different categories, allowing us to make informed decisions based on the proportions rather than just counts!

```python

# Create a DataFrame with some categorical data
data = pd.DataFrame({'Category': ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'A']})

# Calculate the normalized value counts
value_counts_normalized = data['Category'].value_counts(normalize=True)

# Print the normalized value counts
print(value_counts_normalized)

```
### Exercise:

1. use `.value_counts(normalize=True)` to get the proporation of Survived
2. use `.value_counts(normalize=True)` to get the proporation of Sex
3. use `.value_counts(normalize=True)` to get the proporation of Age
