# Scientific Python: Pandas

## Benefits

The ***pandas*** library is a fast, powerful, flexible and easy to use open-source data analysis and manipulation tool that is built on top of the Python programming language.  It is very popular in data science and scientific python in general.  It is especially useful when working with 2D data structures/tables.  Increasing the dimensionality beyond that can make this library difficult to use.

A few key benefits of this library are:

- **An extensive set of features**: the library provides you with a large set of important features and commands that are useful for easily analyzing your data
- **Data representation**: the library is capable of helping you analyze and understand the data better because of its streamlined forms of data representation
- **Made for Python**: with Python growing in popularity in data science and scientific computing, having useful and powerful libraries and packages built on top of the language is always a benefit.
- **Less writing and more work done**: pandas enables you to use a few lines of pandas code to do something that would take several lines of code in basic python
- **Makes data customizable and flexible**: provides a huge feature set to apply on the data you are working with so that you can customize, edit, and pivot it based on your needs
- **Efficiently handles large amounts of data**: the library was developed with the handling of large amounts of data in mind allowing for fast importing of large amounts of data

## Series and Dataframes

### Series
In pandas, a ***series*** is simply a one-dimensional labeled array capable of holding and data type.

In [2]:
import pandas

animals = ["cat", "dog", "rat", "horse", "pig"]

animal_series = pandas.Series(animals, name = "Animals")

print(animal_series)
print(animal_series.name)

0      cat
1      dog
2      rat
3    horse
4      pig
Name: Animals, dtype: object
Animals


### Dataframe
In pandas, a ***dataframe*** is a two-dimensional labeled data structure with columns of potentially different types.  They are very similar to spreadsheets or a SQL table.  They can also be views as a dict of Series objects.

In [16]:
import pandas

car_make = pandas.Series(["Ford", "Toyota", "GMC", "Nissan"], name = "Make")
car_model = pandas.Series(["Focus", "Camry", "Yukon", "Altima"], name = "Model")
car_year = pandas.Series([2008, 2014, 2010, 2022], name = "Year")

car_dataframe = pandas.DataFrame(zip(car_make, car_model, car_year), columns=[car_make.name, car_model.name, car_year.name])

car_dataframe

Unnamed: 0,Make,Model,Year
0,Ford,Focus,2008
1,Toyota,Camry,2014
2,GMC,Yukon,2010
3,Nissan,Altima,2022


## Basic Statistics

Pandas has basic statistical functions built in such as mean, median, and mode.  Just as a recap:
- ***Mean***: average of the dataset
- ***Median***: the middle of the set of numbers
- ***Mode***: the most common number in a dataset
- ***Standard Deviation***: measures the spread of the data relative to the mean
- ***Variance***: measures how much each point differs from the mean

When computing these basic statistics, you have to call it on a pandas data structure whether that be a series or dataframe.  The functions are implemented as methods of the pandas data structures instead of as sub functions of the main library.

In [29]:
import pandas

data = pandas.Series([1, 3, 1, 1, 6, 2, 8, 2, 9, 22, 13, 7, 23], name = "data")

data_mean = data.mean()
print("The mean is:", data_mean)

data_median = data.median()
print("The median is:", data_median)

data_mode = data.mode().item()
print("The mode is:", data_mode)

data_std = data.std()
print("The standard deviation is:", data_std)

data_var = data.var()
print("The variance is:", data_var)

The mean is: 7.538461538461538
The median is: 6.0
The mode is: 1
The standard deviation is: 7.6006072631883015
The variance is: 57.76923076923077


## Useful Functions and Features

The Pandas library has an extensive list of functions and features that will most certainly prove to be useful to you while completing your various data science tasks.  Some of the more notable ones include:

- ***read_csv()*** - use to read in data files in .csv format
- ***head()*** - show the first few rows of a dataset/frame to preview what the data looks like
- ***describe()*** - generates the descriptive statistics for each column in your data fram
- ***memory_usage()*** - provides the memory usage of each data frame column 
- ***astype()*** - sets the data type for a column or series in your data frame
- ***loc\[:\]*** - used to access multiple rows and columns of a data frame
- ***to_datetime()*** - converts a Python object to DateTime format
- ***value_counts()*** - returns counts of all unique values
- ***drop_duplicates()*** - removes all duplicate rows
- ***groupby()*** - used to group a data frame by one or more columns
- ***merge()*** - used to merge 2 pandas data frame objects by column name mappings between column names
- ***sort_values()*** - sort columns in a DataFrame
- ***fillna()*** - used to fill in missing values within dataset