### Pandas - Theory

---

##### What is Pandas?

Pandas is a popular open-source data manipulation and analysis library for Python. It provides easy-to-use data structures and data analysis tools for handling structured data. Pandas is widely used for tasks such as data cleaning, preparation, and analysis. Its primary data structures are Series (one-dimensional labeled array) and DataFrame (two-dimensional labeled data structure with columns of potentially different types). These structures make it easy to work with tabular data, time series, and more. Pandas is a powerful tool for data wrangling and manipulation, and it's often used in conjunction with other libraries like NumPy and Matplotlib for data analysis and visualization.

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

https://pandas.pydata.org/about/index.html

---

##### What are the data types of pandas?

In Pandas, two important data types are Series and DataFrame. 
- A Series is a one-dimensional labeled array capable of holding data of any type
- A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). <br>

These data types are fundamental for working with data in Pandas.

---

##### What is pandas series?

A Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, Python objects, etc.). It's like a column in a table or a single dimension in an array. Each element in the Series has a label, which is its index. This makes it easy to access and manipulate the data. You can create a Series from a list, array, or dictionary. Series are a fundamental data structure in Pandas and are often used as the building blocks for more complex data manipulations and analyses using Pandas.

A Pandas Series is like a column in a table. It is a 1-D array holding data of any type.

---

##### What are the different ways to create a series?

In [1]:
import pandas as pd
import numpy as np

# From a List:
my_list = [10, 20, 30, 40]
series_from_list = pd.Series(my_list)
print(series_from_list)

# From a Numpy Array:

my_array = np.array([10, 20, 30, 40])
series_from_array = pd.Series(my_array)
print(series_from_array)

# From a Dictionary:
my_dict = {'a': 100, 'b': 200, 'c': 300}
series_from_dict = pd.Series(my_dict)
print(series_from_dict)

# From a Scalar Value:
scalar_value = 5
series_from_scalar = pd.Series(scalar_value, index=['a', 'b', 'c', 'd'])
print(series_from_scalar)

0    10
1    20
2    30
3    40
dtype: int64
0    10
1    20
2    30
3    40
dtype: int32
a    100
b    200
c    300
dtype: int64
a    5
b    5
c    5
d    5
dtype: int64


---

##### What are the attributes of series?

Pandas Series have various attributes that provide useful information about the data. Some common attributes include:

<b>1. values:</b> Returns the data as a NumPy array.<br>
<b>2. index:</b> Returns the index labels of the Series.<br>
<b>3. dtype:</b> Returns the data type of the Series.<br>
<b>4. name:</b> Returns the name of the Series.<br>
<b>5. size:</b> Returns the number of elements in the Series.<br>
<b>6. shape:</b> Returns a tuple representing the dimensionality of the Series data.<br>
<b>7. ndim:</b> Returns the number of dimensions of the data.<br>
<b>8. is_unique:</b> Returns if the items os a series is unique or not.<br>

These attributes allow you to access and understand different aspects of the Series data.

---

##### Does negative indexing work on series?

In Python, the Pandas library's Series does not support negative indexing, unlike lists. Negative indexing allows you to access elements from the end of the series. However, in Pandas, you can use iloc and loc to achieve similar results. For example, you can use my_series.iloc[-1] to access the last element of the series.

---

##### What is a dataframe?

A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is a primary data structure in the Pandas library and is designed for handling structured data, similar to a spreadsheet or SQL table. DataFrames allow for easy manipulation, analysis, and cleaning of data, making them extremely useful for data processing tasks in Python.

One column or one row of a dataframe is known as series

---

##### What are the attributes of a dataframe?

In Pandas, a DataFrame has several attributes that provide important information about the data it contains. Some common attributes of a DataFrame include:

<b>1. shape:</b> Returns a tuple representing the dimensionality of the DataFrame (rows, columns).<br>
<b>2. columns:</b> Returns the column labels of the DataFrame.<br>
<b>3. index:</b> Returns the row labels of the DataFrame.<br>
<b>4. dtypes:</b> Returns the data types of each column in the DataFrame.<br>
<b>5. values:</b> Returns the actual data in the DataFrame as a 2D ndarray.<br>
<b>6. T:</b> Returns the transpose of the DataFrame.<br>
These attributes are useful for accessing and understanding the structure of the data within a DataFrame.

---

##### What are the different ways of creating a dataframe?

There are several ways to create a DataFrame in Python using the Pandas library:

<b>1. From a dictionary:</b> You can create a DataFrame from a dictionary where keys are column names and values are lists or arrays representing the data.<br>
<b>2. From a list of dictionaries:</b> You can create a DataFrame from a list of dictionaries where each dictionary represents a row of data.<br>
<b>3. From a 2D array or list:</b> You can create a DataFrame from a 2D array or a list of lists.<br>
<b>4. From a CSV file:</b> You can read data from a CSV file directly into a DataFrame using Pandas' read_csv function.<br>
<b>5. From a SQL database:</b> You can create a DataFrame by querying a SQL database using Pandas' read_sql function.<br>

These are some common ways to create a DataFrame in Pandas. Each method provides flexibility in how data can be imported and structured within a DataFrame.

---

##### What is iloc and loc?

iloc and loc are two important methods in Pandas for selecting row data from a DataFrame:

- <b>iloc:</b> It is used for integer-location based indexing, meaning you can select data based on the integer location of the rows and columns. You can use iloc to select rows and columns by their integer position. If we provide range(start, end), it will include start but ignore end. It is used on default index
- <b>loc:</b> It is label-based, meaning you can select data based on the labels of the rows and columns. With loc, you can select rows and columns using their labels. If we provide range(start, end), it will include both start and end. It is used on custom index

These methods are fundamental for data selection and manipulation in Pandas DataFrames. They provide a powerful way to access specific data points based on their position or label within the DataFrame. If you have more questions about these methods or anything else related to Pandas, feel free to ask!
 

---

##### What is the difference between count and size?

In the context of data analysis and pandas DataFrame:
- count() is used to count the non-null values in each column. It does not count NaN (null) values.
- size gives the total number of elements in the DataFrame, including both non-null and null (NaN) values.

So, count() provides the count of non-null values per column, while size gives the total count of elements in the DataFrame, including both null and non-null values.