# A Pandas Introduction 🐼
### What is Pandas?
Pandas is a Python library used for data manipulation and analysis. It provides highly efficient and easy-to-use data structures for working with structured data, such as CSV files, Excel spreadsheets, SQL databases, and JSON data.
### Download the latest version of pandas
conda: `conda install pandas`

PyPI: `pip install pandas`


Official pandas documentation: <https://pandas.pydata.org/docs/reference/index.html>


---

## Import 
Before we can use the package, we need to import it. Here we abbreviate the package name so we don't have to type the whole name every time.

## Load the data 
Pandas has several functions for loading or reading data. The data will be stored in a *DataFrame*, the main data structure of pandas. In this case we want to load a *.csv* file and therefore use the function  `read_csv()`.

It's also possible to save the *DataFrame* again as csv with the function `to_csv()`. By default, a new index column is created when a DataFrame is saved, the `index=False` option prevents this.

## Display DataFrame
To show and validate the loaded data, we are able to use different functions. Besides displaying the entire DataFrame by calling the variable, `head()` displays the first 5 values and `tail()` displays the last values. If we want to display more than the 5 values, we can write the specific number in the function call.

*Note:* In Jupyter Notebooks, an output is automatically displayed in the last line of each cell. Any other desired output has to be wrapped in a `print()` function

For some useful information about the DataFrame, such as the names, amount of values or data types of each column, you may use the `dtypes` attribute for data types only or the `info()` method for a more detailled output.

## Column indexing and statistical calculations
If we want to address or retrieve a single column of a DataFrame, we can index it by using the  index operator `[]` with the desired column name. This single column acts as a *Pandas series* and we can therefore interact with it in a similar way as we did before. 
Since "Hits" is a numeric column, we can calculate statistical metrics using the built-in methods. 

We can also check which values are available in a specific columns (duplicates are dropped):

## Not a Number (NaN)
Data often contains missing or undefined datapoints, which are presented to us as `NaN`. These datapoints can interfere with calculations and analysis, so we need to address them. Here we will look at 2 methods: *dropping* and *filling*. While the former removes any row that contains a "NaN" value, the latter replaces the "NaN" value according to a chosen strategy, such as mean or median, forward or backward filling.

If we want to print the amount of `NaN` values in a Series, we can use the function `isna()` and then sum all the `True` values with `sum()`.

## Indexing multiple columns
It is also possible to retrieve multiple columns of a DataFrame. 

*Note:* To index multiple columns, it is necessary to use double square brackets `[[ ]]`

## Filter / Conditions
Pandas provides the application of conditions to filter a DataFrame. Here, we want to select only the players who have made over 100 hits.

Next, we want all players who have made between 100 and 500 hits (without 100 and 500).

We can also filter all players who have been actively playing basketball for 10 years and finally output only the columns `name` and `salary`.

## Row Indexing
While column indexing is possible based on the column names, row indexing requires the integer index. The function `iloc()` returns the corresponding row for a given index.

We can also specify multiple indices. For this, double square brackets must be specified again.

## Plotting
To plot data of a DataFrame, Pandas provides several plotting functions.

The chart is inconsistent and contains gaps because there are `NaN` values in the data. To solve this problem, we can use a method mentioned above to handle this. Here we delete every row with `NaN` values and reset the index of the DataFrame manually.

We can also specify columns as the x-axis and the y-axis.

## Adding new columns
To add a new column to a Pandas DataFrame, we simply use the index operator `[]` to assign a value or an expression to a new column label. In this example, we create a new column called 'avg_hits' by dividing the 'Hits' column by the 'Years' column.

## Count unique values
To count the number of unique values in a column of a Pandas DataFrame, we can use the `value_counts()` method. This method returns a new Series object that contains the counts of each unique value in the column, in descending order.

## Group values
To group a Pandas DataFrame by a column, we can use the `groupby()` method. In this example, we group the hitter_df DataFrame by the 'Division' column by passing 'Division' as an argument to the `groupby()` method.

## Sort values
Sorting values can help identify patterns, can enable more efficient querying and filtering, and can improve data visualization. To sort values in pandas, you can use the sort_values() method on a DataFrame object. 

*Note:* `NaN`values are listed at the very end of each sort.

## Creating Data Frames manually
Sometimes it becomes necessary to create a small DataFrame manually. This is done by providing the data as dictionary of lists. Here is an example: