<center>  <h1>Introduction To Pandas</h1></center>

<center><h3>CHAPTER 1</h3></center>
In this chapter, we will cover the following topics:

- Introduction to the world of pandas
- Exploring the history and evolution of pandas
- Components and applications of pandas
- Understanding the basic concepts of pandas
- Activity – comparing sales data for two stores


### History

- Open sourced in 2009 by Wes McKinney, an MIT graduate with experience in quantitative finance.
- McKinney aimed to create an intuitive and elegant tool that required minimal code due to dissatisfaction with existing tools.
- Became one of the most popular tools in the data science community.
- Contributed significantly to the increased popularity of Python in the data science field.

### Versatility of Pandas:

- Well-suited for handling various types of data, including:
- Tabular data with columns capable of storing different types of data (e.g., numerical and text data).
- Ordered and unordered series data (e.g., [2, 4, 8, 9, 10]).
- Multi-dimensional matrix data (three-dimensional, four-dimensional, etc.).
- Any other form of observational/statistical data (e.g., SQL data and R data).

### Key Features:

- Large repertoire of intuitive and easy-to-use functions/methods.
- Pandas is the go-to tool for data analytics.

### Pandas Components and Applications:

- pandas/core: Implements basic data structures like Series and DataFrames, crucial for data manipulation.
- pandas/src: Houses foundational algorithms written in C or Cython, providing core functionalities implicitly used by users.
- pandas/io: Toolsets for file and data input/output, supporting formats like CSV and text (covered in Chapter 3, Data I/O).
- pandas/tools: Code and algorithms for functions/methods like merge, join, and concat.
- pandas/sparse: Manages missing values in data structures like DataFrames and Series.
- pandas/stats: Contains tools for statistical functions such as regression and classification.
- pandas/util: Includes utilities for debugging the library.
- pandas/rpy: Interface for connecting to R.

### notable applications

- Recommendation systems
- Advertising
- Stock predictions
- Neuroscience
- Natural language processing (NLP)

### The Series object

- Begin with understanding one-dimensional data in pandas.
- One-dimensional data is represented as Series objects in pandas.
- Series objects are initialized using the **pd.Series()** constructor.

**Code Example:**

- The following code demonstrates the use of pd.Series() constructor.
- Create a new Series named ser1 using the constructor, Calling the Series by its assigned name (e.g., ser1) will display its contents.


In [1]:
# importing pandas library
import pandas as pd

# Creating a Series
ser1 = pd.Series([10, 20, 30, 40])

# Displaying the Series
ser1

0    10
1    20
2    30
3    40
dtype: int64

- From the output, you can see that the one-dimensional list is represented as a Series, The numbers to the left of the Series (0, 1, 2, 3) are its indices.
- You can represent different types of data in a Series. For example, consider the following snippet:


In [2]:
ser2 = pd.Series(
    [[10, 20], [30, 40.5, "series"], [50, 55], {"Name": "Tess", "Org": "Packt"}]
)
ser2

0                            [10, 20]
1                  [30, 40.5, series]
2                            [50, 55]
3    {'Name': 'Tess', 'Org': 'Packt'}
dtype: object

### The DataFrame object

- DataFrame is a fundamental structure in pandas.
- Represents two-dimensional data in rows and columns.
- Use the DataFrame() constructor to initialize a DataFrame in pandas.
- The following code demonstrates the conversion of a simple list object into a one-dimensional DataFrame.


In [3]:
# Create a DataFrame using the constructor
df = pd.DataFrame([30, 50, 20])

# Display the DataFrame
df

Unnamed: 0,0
0,30
1,50
2,20


- The above code Demonstrated in the preceding code, where a list of three elements is converted into a DataFrame using the DataFrame() constructor.

**DataFrame Shape Visualization:**

- Use the **df.shape()** command to visualize the shape of a DataFrame.
- The shape provides information about the number of rows and columns in the DataFrame.


In [4]:
df.shape

(3, 1)

- Here, the first element (3) is the number of rows, while the second (1) is the number of columns.


- If you look at the DataFrame, you will see 0 at the top of the column.
- This is the default name that will be assigned to the column when the DataFrame is created.
- You can also see the numbers 0, 1, and 2 along the rows. These are called indices.
- To display the column names of the DataFrame, you can use the following command:


In [5]:
df.columns

RangeIndex(start=0, stop=1, step=1)

- The output shows that it is a range of indices starting at 0 and stopping at 1 with a step size of 1, So, in effect, there is just one column with the name 0.
- You can also display the names of the indices for the rows using the following command:


In [6]:
df.index

RangeIndex(start=0, stop=3, step=1)

- There are many instances where you would want to use the column names and row indices for further processing.
- For such purposes, you can convert them into a list using the list() command.
- The following snippet converts the column names and row indices into lists and then prints those values:


In [7]:
# print names of columns
print("These are the names of the columns", list(df.columns))

# print indices
print("These are the row indices", list(df.index))

These are the names of the columns [0]
These are the row indices [0, 1, 2]


- From the output, you can see that the column names and the row indices are represented as a list.
- You can also rename the columns and row indices by assigning them to any list of values.
- The command for renaming a column is **df.columns**, as shown in the following snippet:


In [8]:
# Renaming the columns
df.columns = ["V1"]
df

Unnamed: 0,V1
0,30
1,50
2,20


- The command for renaming an index is **df.index**, as shown in the following snippet:


In [9]:
# Renaming the indices
df.index = ["R1", "R2", "R3"]
df

Unnamed: 0,V1
R1,30
R2,50
R3,20


- what if you need to create a DataFrame that contains multiple columns from the list data? This can easily be
  achieved using a nested list of lists, as follows:


In [10]:
# Creating DataFrame with multiple columns
df1 = pd.DataFrame([[10, 15, 20], [100, 200, 300]])

print("Shape of new data frame", df1.shape)
df1

Shape of new data frame (2, 3)


Unnamed: 0,0,1,2
0,10,15,20
1,100,200,300


- you can see that the new DataFrame has two rows and three columns.
- The first list forms the first row, and each of its elements gets mapped to the three columns.
- The second list becomes the second row,
- You can also assign the column names and row names while creating the DataFrame, To do that for the preceding DataFrame, the following command must be executed:


In [11]:
df1 = pd.DataFrame(
    [[10, 15, 20], [100, 200, 300]], columns=["V1", "V2", "V3"], index=["R1", "R2"]
)
df1

Unnamed: 0,V1,V2,V3
R1,10,15,20
R2,100,200,300


- From the output, you see that the column names (V1, V2, and V3) and index names (R1 and R2) have been initialized with the user-provided values.

### Working with local files

- involves importing data from various source files and exporting outputs in different formats.
- Essential processes in data manipulation and analysis.
- Indispensable when handling data tasks.
- The following exercise will focus on performing preliminary operations with a CSV file.

- **Dataset : Student Performance Data**,
- sourced from the UCI Machine Learning library.
- This dataset details student achievement in secondary education in two Portuguese schools. Some of the key variables of the dataset include student grades, demographic information, and other social and school-related features, such as hours of study time and prior failures.
- The dataset has been sourced from P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance.
- In A. Brito and J. Teixeira Eds., Proceedings of 5th Future Business Technology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April 2008, EUROSIS, ISBN 978-9077381-39-7.
- Link -> https://archive.ics.uci.edu/ml/datasets/Student+Performance.


### Reading a CSV file

- To read a CSV file, you can use the following command **pd.read_csv(filename, delimiter)**
-
