# Introduction to pandas
***pandas*** is a popular *python library* used for data analysis and manipulation. It provides powerful data structures like **DataFrame** and **Series** that make working with structured data easy and intuitive.
## Setup
To download and install pandas, write on the terminal

In [None]:
pip install pandas

To import pandas

In [None]:
import pandas as pd

## Creating Data
Well, to work with data, we need data. Creating data is one way to go. For that there here are two core objects in pandas: the **DataFrame** and the **Series**.

### DataFrame
A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. It's similar to a spreadsheet or SQL table.

#### Creating DataFrame
##### From Dictionary

In [None]:
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)
df

##### From a CSV file:

In [None]:
df = pd.read_csv("archive\HR-Employee-Attrition.csv")
df

### Series
A Series is a one-dimensional labeled array capable of holding data of any type. If a DataFrame is a table, a Series is a list. A series is a column in a data frame.

In [None]:
s = pd.Series([1, 2, 3, 4, 5])
s

A Series is, in essence, a single column of a DataFrame. So you can assign row labels to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name.

In [None]:
s = pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
s

## Viewing data
Display the first few rows

In [None]:
df.head()

Display the last few rows

In [None]:
df.tail()

Display basic information about the DataFrame

In [None]:
df.info()

Summary statistics of numerical columns

In [None]:
df.describe()

## Indexing and Selection:
***One thing to remember*** : Both *loc* and *iloc* are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

### Index-based selection
Selecting a column

In [None]:
df["Age"]

Selecting multiple columns

In [None]:
df[['Education', 'Age']]

Selecting rows by index

In [None]:
df.iloc[0]  # First row

In [None]:
df.iloc[1:3]  # Rows 2 to 3

In [None]:
df.iloc[1,3]  # Row 2, Column 4

In [None]:
df.iloc[:,0]  # All Rows, Column 1

In [None]:
df.iloc[:3,0]  # Row-[1,2,3], Column 1

In [None]:
df.iloc[1:3,0]  # Row-[2,3], Column 1

List can be used too

In [None]:
df.iloc[[0,1,3], 0] # Row-[1,2,3], Column 1

Negative Indexing also works

In [None]:
df.iloc[-5:] # Last 5 Rows

### Label-based selection:

In [None]:
df.loc[0,'Age']# Row 1 and 'Age' Column

In [None]:
# All Rows and 'Age', 'Education' columns
df.loc[:, ['Age', 'Education']] 

***Another thing to remember*** : 
- iloc[x:y] = x...(y-1) (y exclusive)
- loc[x:y] = x...y (y inclusive)

## Manipulating the index

In [None]:
df.set_index("Title")

This is useful if you can come up with an index for the dataset which is better than the current one.

## Conditional selection

In [None]:
# All Rows with Age == 21
df.loc[df.Age == 21]

In [None]:
# All rows with Age==21 and YearsAtCompany>2
df.loc[(df.Age==21) & (df.YearsAtCompany>2)]

In [None]:
# All rows with Age==21 or YearsAtCompany>2
df.loc[(df.Age==21) | (df.YearsAtCompany>2)]

In [None]:
# All rows with assigned Age
df.loc[df.Age.notnull()]

## Assigning Data

In [None]:
# Fill the Education column with B.Sc.
df['Education'] = 'B.Sc.'
df['Education']

In [None]:
# Filling Education using range function
df['Education']= range(len(df),0,-1)
df['Education']

### Practice
Say, **reviews** is a DataFrame.
1. Select the first value from the description column of reviews, assigning it to variable first_description.

2. Select the first row of data (the first record) from reviews, assigning it to the variable first_row.

3. Select the first 10 values from the description column in reviews, assigning the result to variable first_descriptions.


4. Select the records with index labels 1, 2, 3, 5, and 8.

5. Create a variable df containing the country, province, region_1, and region_2 columns of the records with the index labels 0, 1, 10, and 100.

6. Create a variable df containing the country and variety columns of the first 100 records.

7. Create a DataFrame italian_wines containing reviews of wines made in Italy.

8. Create a DataFrame top_oceania_wines containing all reviews with at least 95 points (out of 100) for wines from Australia or New Zealand.

*Note: Use of parenthesis is crucial*