## Import Packages and Libraries
This is the lineup of the most important Python libraries for data analytics:

### Data Processing and Modeling:

*   NumPy
*   Pandas

### Data Visualization

*   Matplotlib
*   Seaborn

**Note**❗: Import only the packages which will be used!

In [None]:
import numpy as np #np is an alias pointing to Numpy
import pandas as pd #pd is an alias pointing to Pandas

import matplotlib.pyplot as plt #plt is an alias pointing to Matplotlib
import seaborn as sns #sns is an alias pointing to Seaborn

## Read the Dataset
Import a dataset often depends on the format of the file (Excel, CSV, text, SPSS, Stata, etc.). These are the most important ways to read the datasets:

**Notes**❗: 

*   CSV stands for comma-separated values.
*   You need to pay attention when writing the path of the datatset.







In [None]:
dataFrame = pd.read_csv('path/file.csv')
dataFrame = pd.read_excel('path/file.xlsx')

## Inspecting the Data Frame

The main intentions of inspecting our data are:

*   To have an idea of the size of the dataset.
*   To get the data type of each variable in the dataset.
*   To identify whether there are missing values in the dataset.

And also for another reasons...

**Note**❗: shape is Attribute and not methods like head, info and describe.


In [None]:
# Print the head of the Data Frame
print(dataFrame.head())

# Print information about Data Frame
print(dataFrame.info())

# Print the shape of Data Frame
print(dataFrame.shape)

# Print a description of Data Frame
print(dataFrame.describe())

df.shape # Attribute
df.head()# Method

## Sort the Rows in the Data Frame

Pandas `sort_values()` method sorts a data frame in Ascending or Descending order of passed Column.

**Note**❗: `sort_values()` sorts Ascending by deafult.

In [None]:
# One column
dataFrame.sort_values("col 1", ascending=True)

# Multiple columns
dataFrame.sort_values(["col 1", "col 2"], ascending=[True, False])

## Explicit Indexes

*   Columns and Index
*   Setting a column as the Index
*   Subsetting with Index
*   Sort Index


In [None]:
# Columns and Index
dataFrame.columns
dataFrame.index

# Setting a column as the Index
dataFrame_ind = dataFrame.set_index("col")

# Subsetting with Index
dataFrame[dataFrame["col"].isin(["row 1", "row 2"])] # First way
dataFrame.loc[["row 1", "row 2"]] # Second way

# Sort Index
dataFrame_index = dataFrame.set_index(["col 1", "col 2"])
dataFrame_index.sort_index(level=["col 1", "col 2"], ascending=[True, False])

## Slicing and Subsetting

*   Slicing columns
*   Slicing by dates
*   Subsetting by row/column number
*   Subsetting with conditions
*   Subsetting rows by categorical variables
*   Adding a new column
*   Writing to CSV-File

**Notes**❗:

*   The Data frame should be sorted before slicing.
*  **i**loc: i for integer
*   Logical Operators in Pandas are (&, | and ~)
*   The parentheses (...) is **important**❗










In [None]:
# Slicing columns
dataFrame_sorted.loc[:, "col_index 1":"col_index 2"]

# Slicing by date
dataFrame_sorted.loc["yyyy-mm-dd":"yyyy-mm-dd"]

# Subsetting by row/column number
dataFrame.iloc[3:6, 1:5])

# Subsetting with conditions
SubSet = dataFrame[(dataFrame['col 1']<1000)& (dataFrame['col 2'] == 'value in col 2')]

# Subsetting rows by categorical variables
listName = ["value 1", "value 2", "value 3", "value 4"] # List of values in a column
SubSet = dataFrame[dataFrame["col"].isin(listName)]

# Adding a new column
dataFrame["new column"] = dataFrame["col 1 with integer values"] / dataFrame["col 2 with integer values"] 

# Writing to CSV-File
dataFrame.to_csv("dataFrame_with_new_column.csv")

## Visualizing the Data


*   Histograms
*   Barplots
*   Lineplots
*   Scatterplots
*   Boxplot **(Comparison)**

**Notes**❗:

* Adjust the number of bars, or bins, using the **"bins"** argument. Increasing or decreasing this can give us a better idea of what the distribution looks like.



In [None]:
# Histograms
dataFrame["col"].hist()
dataFrame["col"].hist(bins=30)

dog_pack[dog_pack["col 1"]=="value in col 1"]["col 2"].hist(alpha=0.7)
dog_pack[dog_pack["col 1"]=="value in col 1"]["col 2"].hist(alpha=0.7)
plt.legend(["value in col 1", "value in col 1"])


# Barplots
dataFrame.plot(kind="bar", title="Title of the Plot")

# Lineplots
dataFrame.plot(x="col 1", y="col 2", kind="line")

# Scatterplots
dataFrame.plot(x="col 1", y="col 2", kind="scatter")

sns.scatterplot(x='col 1', 
                y='col 2',
                hue='col 3', # This will be plotted in different colors.
                data=dataFrame)

plt.show() # To show the Plot

# Boxplot
sns.boxplot(x='col 1', 
            y='col 2',
            hue='col 3', # This will be plotted in different colors.
            data=dataFrame)

## Missing Values

*   Detecting missing values
*   Detecting any missing values
*   Counting missing values
*   Removing missing values
*   Replacing missing values



In [None]:
# Detecting missing values
dataFrame.isna() # It returns True or False

# Detecting any missing values
dataFrame.isna().any()

# Counting missing values
dataFrame.isna().sum()

# Removing missing values
dataFrame.dropna()

# Replacing missing values
dogs.fillna(0)

## Important Defentions

### KPI: 
KPI (Key Performance Indicator) is a type of performance measurement. KPIs evaluate the success of an organization or of a particular activity (such as projects, programs, products and other initiatives) in which it engages.

Depending on the project you are working with it could be:
*  number of new customers per month
*  sum of the orders value per day
*  Net Promoter Score (NPS)

And many many more...

### Model Bucket:

*   Change over time
*   **Comparison**
*   **Part of a whole**
*   **A Correlation**
*   **Ranking**
*   **Distribution**
*   Flows and relationships
*   Geospatial

### Numerical Data


*   Cumulative sum
*   Sum
*   Median
*   Minimum
*   Maximum
*   Standard deviation
*   Quantil 
*   Mode
*   Count Values
*   Variance


In [None]:
# Median
dataFrame["col"].median()

# Mode
dataFrame.mode()

# Minimum
dataFrame["col"].min()

# Maximum
dataFrame["col"].max()

# Unbiased Variance
dataFrame.var() # unbiased variance

# Standard deviation
dataFrame["col"].std()

# Sum
dataFrame.sum()

# Quantil
dataFrame.quantile()
 
# Count Values
dataFrame["col"].value_counts(sort=True) # Sort is False by default.