# Short intro to NumPy

Let's start by mentioning NumPy as Pandas builds on top of NumPy.

NumPy is a powerful library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with an extensive collection of mathematical functions to operate on these arrays efficiently.

In [None]:
# To use a library:
#
# 1-The library must installed in our computer.
#
# 2-Import the library or some functions from the library.
#
# import library_name
# import library_name as alias
# from library import function, class,...
#
import numpy as np

# Creating a NumPy array
arr = np.array([1, 2, 3, 4, 5])

# Performing operations on the array
result = arr * 2
print(result)  # Output: [2 4 6 8 10]


In [None]:
# other ways to do it

# loops
my_list = [1,2,3,4,5]
for i in range(len(my_list)):
    my_list[i] = my_list[i]*2

# map
list(map(lambda x: x*2, my_list))

# list comprehension
[x*2 for x in my_list]

In [None]:
#how to do this for a dict

my_dict = {"a":1, "b":2, "c":3}

for key in my_dict.keys():
    my_dict[key] = my_dict[key] * 2

for key, value in my_dict.items():
    my_dict[key] = value * 2

{key: value*2 for key, value in my_dict.items()}

# Introduction to Pandas Library

`pandas` is a powerful library designed for working with tabulated and tagged data, making it ideal for handling spreadsheets, SQL tables, and more. It revolves around two main data structures: `Series` and `DataFrames`.

## Key Features of Pandas

- Robust data import/export: `pandas` provides efficient tools to read and write data from various file formats like CSV, XLS, SQL, and HDF5.
- Data manipulation: With `pandas`, you can easily filter, add, or delete data, enabling seamless data processing.
- High-performance and versatility: It combines the performance of `numpy` arrays with the ability to handle tabulated data efficiently.

To import the necessary modules from the `pandas` library, we use the following syntax:

In [None]:
import pandas as pd  # 'pd' is an alias for pandas

![](https://github.com/data-bootcamp-v4/lessons/blob/main/img/pandas.svg?raw=true)

## DataFrames and Series in Pandas

In `pandas`, a `Series` is a one-dimensional array of data with associated labels called the *index*. If no index is specified, it generates an ordered sequence of integers.

```python
# Creating a Series
s = pd.Series(data, index=index)
```

On the other hand, a `DataFrame` is a two-dimensional data structure that stores data in tabular form, with labeled rows and columns. Each row represents an observation, and each column represents a variable. `DataFrame` can handle heterogeneous data with different types (numeric, string, boolean, etc.). It also includes variable names, types, and methods to access and modify the data.

```python
# Creating a DataFrame
df = pd.DataFrame(data, ...)
```


---
# Series in Pandas

## Creating Series


### From a list

Create a Series with default indexes from a list

In [None]:
l = [1980, 2020, 2001, 1999]
series = pd.Series(l)
series #we can see dtype int which means the series has integers

In [None]:
# To know the version of a library
print(f"The version of numpy is: {np.__version__}")
print(f"The version of pandas is: {pd.__version__}")

In [None]:
type(series) #but careful, its still a pandas series

In [None]:
series.dtype #this gives me the type of the elements inside the series
# this is an attribute (remember our lesson about classes)

The `Series` have two attributes:` values` and `index`. The first is a `numpy array` that stores the data, and the second is an object that contains the indexes.

In [None]:
series.values

In [None]:
series.index

In Pandas, `Series.items()` is a method used to iterate over the elements of a Pandas Series. It returns an iterator that yields the index-label and corresponding value pairs of the Series.

In [None]:
for key in series.keys():
    print(key)

In [None]:
#items has tuples (index,value) so i need to iterate with two variables if i want the values separated,
#if not, i get the tuple in each iteration
for i in series.index:
    print(i)

#for key, value in series.items():
#    print(key, value)

In [None]:
for i, v in series.items() :
    print("index ", i)
    print("value ", v)

### From a list with index

When creating a `Series`, you can explicitly define an `array` index and pass it as an argument.


Creating series with defined indexes

In [None]:
# create a new series with same values as before, and as indexes, names
# values [1980, 2020, 2001, 1999]
# index "deb","d","martha", "martin"
new_series = pd.Series([1980, 2020, 2001, 1999],
                       index = ["deb","d","martha", "martin"])
new_series

### From a dictionary

In [None]:
# using dict comprehension, create a dictionary that has as key "square of {nr}" and as value the nr squared
# for values from 0 to 10
d = {"square of "+str(n): n*n for n in range(11)}
d

In [None]:
pd.Series({str(n): n for n in range(11)})

In [None]:
pd.Series(d) # this creates a series from the dictionary 'd'

### From a file

`read_csv()` is a Pandas function used to read data from a CSV file and create a DataFrame.

When assigning one column in the parameter `usecols` and then calling the method `squeeze("columns")`, the result is a Series instead of a Dataframe

In [None]:
# Load Titanic dataset from an online source
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/titanic_train.csv'
titanic_series = pd.read_csv(url, usecols=["Name"]).squeeze("columns")
titanic_series



---
## Data access in Series


Data access in Pandas can be achieved through either the categorical index or the internally generated numerical index.

In [None]:
new_series

In [None]:
new_series[1] #accessing elements by pandas internal index

In [None]:
new_series["d"] #accessing elements by index label

In [None]:
new_series[1:]  #we can do the same as with lists, and use slicing

In [None]:
new_series[1:3:2] # 1, 3, 5,...

In [None]:
new_series[::-1]



---
## Methods in Series

### `concat()`

`concat()` is a pandas function used to concatenate and combine DataFrames along a specified axis, either vertically (rows) or horizontally (columns).

In [None]:
s3 = pd.concat([new_series,new_series])
s3

By default `concat` keeps the original indexes. It does not restart the index by default, unless we specify `ignore_index=True`.

In [None]:
s3["d"] # we see we have two elements for index "d"

In [None]:
s3[1]

In [None]:
s3 = pd.concat([new_series,new_series],ignore_index=True)
s3

In [None]:
pd.DataFrame(pd.concat([new_series,new_series]))

In [None]:
pd.DataFrame(pd.concat([new_series,new_series])).reset_index()

In [None]:
pd.concat([new_series,new_series], axis=1)

### `sort_values() and sort_index()`

`sort_values()` is a pandas DataFrame method that sorts the DataFrame based on specified column(s), while `sort_index()` sorts the DataFrame based on its index labels.

In [None]:
#titanic_series
titanic_series.sort_values()

In [None]:
titanic_series.sort_values(ascending=False) #to change the ordering type

Sort index or sort values does not change the series.
We can either save it again in the variable or we can use a paremeter
called **inplace**.

In [None]:
titanic_series # we see our last line of code didn't change our series

In [None]:
titanic_series.sort_index(ascending=False, inplace=True)

# equivalent: titanic_series = titanic_series.sort_index(ascending=False)

In [None]:
titanic_series # the series has changed because we used 'inplace=True'

### `value_counts()`

`value_counts()` is a Pandas function that returns a Series containing the counts of unique values in a Series or DataFrame.

In [None]:
pd.read_csv(url)['Sex'].value_counts()

In [None]:
pd.read_csv(url)['Sex'].unique()

In [None]:
titanic_series.value_counts()

## 💡 Check for understanding

1. Get the column "Embarked" from the Titanic csv as a Pandas Series
2. Print the first value
3. Print the last 5 values
3. Append "NA" to the Series
3. Get the number of each Embarked type (number of repeated values)
3. Order the Series *descending*, and print the Embarked type most repeated in the Series

In [None]:
# Your answer here
# 1.
pd.read_csv(url)["Embarked"]

In [None]:
# 2.
pd.read_csv(url)["Embarked"][0]

In [None]:
# 3.
pd.read_csv(url)["Embarked"][-5:]

In [None]:
# 4
pd.Series(list(pd.read_csv(url)['Embarked'].values) + ["NA"])

In [None]:
# 5.
pd.read_csv(url)["Embarked"].nunique()

In [None]:
# 6.
pd.read_csv(url)["Embarked"].sort_values(ascending=False)


# Dataframes in Pandas

Unlike `Series`, `DataFrame` is designed to store heterogeneous multivariable data. It can be created from dictionaries (look to Extra section for examples).

We will use `read_csv` to read data from a CSV file and create a DataFrame.

We won't be using now the parameter `usecols` nor calling the method `squeeze("columns")`.

In [None]:
# Load Titanic dataset from an online source
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/titanic_train.csv'
titanic_df = pd.read_csv(url)

### `head()` and `tail()`

The `head()` method returns the first few rows of a DataFrame or Series, while the `tail()` method returns the last few rows.

In [None]:
titanic_df.head(7)

In [None]:
titanic_df.tail(3)

### `index, columns and values`

The `index` attribute in Pandas returns the index labels of a DataFrame or Series, and the `columns` attribute returns the column labels of a DataFrame.

In [None]:
titanic_df.index

`RangeIndex(start=0, stop=1460, step=1)` is a special type of index in Pandas called a `RangeIndex`. It represents a range of integer values starting from 0, up to (but not including) 1460, with a step size of 1.

In [None]:
list(titanic_df.index)

In [None]:
titanic_df.columns

In [None]:
list(titanic_df.columns)

In Pandas, `df.values` is an attribute that returns the underlying NumPy array of a DataFrame df. It represents the actual data stored in the DataFrame as a two-dimensional NumPy array.

In [None]:
titanic_df.head(5)

In [None]:
titanic_df.values


Additionally, `Pandas` can create Dataframes from other sources, such as JSON, URL ...


---
## Data access in Dataframe


### Columns

You can extract columns from a `DataFrame` using dictionary-like notation or as attributes, obtaining a `Series` object in both cases, provided the column label is a valid Python identifier.

In [None]:
titanic_df["Sex"] #dict way of accessing a column

In [None]:
titanic_df.Sex #attribute access of the column

#this only works if the column name doesn't have unusual characters, this means
#no spaces!!!

In [None]:
titanic_df[["Sex"]] #by using double [] we get a Dataframe instead
# of a series

In [None]:
titanic_df[["Sex","Fare"]]

### Rows

To access rows in a Pandas DataFrame, you can use the `iloc` attribute with integer-based indexing or the `loc` attribute with label-based indexing. For example, `df.iloc[0]` will access the first row, and `df.loc['row_label']` will access the row with the specified label.


<div class="alert alert-info">Note: Consult http://pandas.pydata.org/pandas-docs/version/0.18.1/indexing.html#different-choices-for-indexing to understand the differences between methods.</div>

Just for the sake of this example, we will create a dictionary with a non-default index.

In [None]:
d_index = {
    "name": ["Paula", "Mark"],
    "score": [98.5, 95]
}

df_index = pd.DataFrame(d_index, index=["123A", "789B"])
df_index


**loc** is used for label-based indexing to access rows.

In [None]:
df_index.loc["789B"]


**iloc** is used for integer-based indexing in Pandas to access rows

In [None]:
df_index.iloc[1]

### Specific values

Here's a summary of the different ways to access individual values in a Pandas DataFrame:

1. `df.loc[row_label][column_label]`: Chooses the row with the label and then the value in that row with the column label.

2. `df.iloc[row_position][column_label]`: Selects the row with the position and then the value in that row with the column label.

3. `df.loc[row_label, column_label]`: Directly accesses the value using both row and column labels.

4. `df.iloc[row_position, column_position]`: Directly accesses the value using both row and column positions.

Note: you might see `df[row_label][column_label]`, even though is less recommended that using `loc()`.

In [None]:
# first I access the row with the index, then I access the value
# with the index of the resulting series (name of the column)
df_index.iloc[0]["score"]

In [None]:
# first I access the row with the label, then I access the value
# with the index of the resulting series (name of the column)
df_index.loc["789B"]["score"]

In [None]:
df_index.loc["789B","score"]


---
## More DataFrame Methods


Let's see some useful methods of the Dataframe class.

### `shape`

`shape` is a Pandas attribute that returns a tuple representing the dimensions of a DataFrame or Series, indicating the number of rows and columns, while `shape()` is not a valid function in Pandas, and attempting to call it will result in an AttributeError.

In [None]:
titanic_df.shape # gives a tuple (nr rows, nr columns)

In [None]:
titanic_df.shape[1] #tuples are accesed like this. nr of columns

In [None]:
titanic_df.shape[0] # nr of rows

### `describe()`

`describe()` is a Pandas method that generates descriptive statistics of a DataFrame, providing information on count, mean, standard deviation, minimum, maximum, and quartiles for each numerical column.

In [None]:
titanic_df.describe()

In [None]:
titanic_df.describe(include="object")

In [None]:
titanic_df.describe(include="all")

### `info()`

`info()` is a Pandas method that provides a concise summary of a DataFrame, including information about the data types, non-null values, and memory usage.

In [None]:
titanic_df.info()

### `nunique() and unique()`

`nunique()` is a Pandas function that returns the number of unique elements in a Series or DataFrame, while `unique()` returns an array of unique elements in a Series or DataFrame.

In [None]:
titanic_df.nunique() #number of unique values per column

In [None]:
#see unique values of one column
titanic_df.Pclass.unique()

### `dtypes`

`dtypes` is a Pandas attribute that returns the data types of each column in a DataFrame, while `dtype` is a method that returns the data type of a single element in a Series or DataFrame.

In [None]:
titanic_df.dtypes

### `select_dtypes()`

`select_dtypes()` is a pandas function used to filter columns in a DataFrame based on their data types. It allows you to select numeric, object (string), boolean, datetime, or categorical columns.

Syntax:
```python
DataFrame.select_dtypes(include=None, exclude=None)
```

- `include`: A list of data types or strings representing data types to include in the selection. If specified, only columns with these data types will be included.
- `exclude`: A list of data types or strings representing data types to exclude from the selection. If specified, columns with these data types will be excluded.

Example:
```python
# Assuming df is a DataFrame
numeric_columns = df.select_dtypes(include='number')
object_columns = df.select_dtypes(include='object')
```



In [None]:
titanic_df.select_dtypes(include=['int','float'])

In [None]:
titanic_df.select_dtypes(include='number').head()

### Aggregation such as `max()`

In [None]:
titanic_df.select_dtypes(include='number').max() #get the max for each numerical column in the df

In [None]:
titanic_df["Fare"].max() #get the max for one numerical column

 Just like `max()`, there are many methods that can be applied to either the entire dataframe or its individual columns.

## 💡 Check for understanding

- a. Use the original titanic_df
- b. Select the `Sex` and `Fare` columns.
- c. Indicate how many different types of `Sex` there are.
- d. Indicate how many `Sex` of each type there are.
- e. Show a statistical summary of all the variables.
- f. Write some conclusions

In [None]:
# Your code here
# b.
titanic_df[["Sex","Fare"]]

In [None]:
# c.
titanic_df['Sex'].nunique()

In [None]:
# d.
titanic_df['Sex'].value_counts()

In [None]:
# e.
titanic_df.describe(include='all')

# Summary

- Pandas is a library designed for working with tabulated and tagged data, making it ideal for handling spreadsheets, SQL tables, and more, built on top of NumPy.
- DataFrames and Series are the two main data structures in Pandas.
- Series is a one-dimensional array of data with associated labels called the index, while DataFrame is a two-dimensional tabular data structure with labeled rows and columns.
- Data access in Series and DataFrame can be achieved using integer-based indexing (iloc), label-based indexing (loc), or dictionary-like notation for column access.
- Series and DataFrame have various methods, such as sort_values(), sort_index(), value_counts(), describe(), info(), nunique(), unique(), dtypes, and select_dtypes().


# Extra: Creating Dataframes from a Dictionary

In [None]:
# Create a Dataframe from a dictionary with
# automatic indexes

d = {"state": ["Ohio", "Ohio", "California", "Nevada", "California"],
     "year": [2000, 2001, 2002, 2001, 2002],
     "avg": [1.5, 1.7, 3.6, 2.4, 1.9]
}

df = pd.DataFrame(d)
df


DataFrame from a dictionary of lists and indexes

In [None]:
d_index = {
    "name": ["Paula", "Mark"],
    "score": [98.5, 95]
}

df_index = pd.DataFrame(d_index, index=["123A", "789B"])
df_index

# Extra: pickle

Pandas `pickle` function provides a convenient way to save and load Python objects, including DataFrames, to and from disk. Pickling allows you to serialize Python objects into a binary format, making it easy to store large datasets or complex data structures. It's a great tool for saving and restoring your work, especially when dealing with large datasets that might take a long time to process or recreate.

## Saving DataFrames with Pickle

You can use the `to_pickle()` method in pandas to save a DataFrame to a pickle file. This method takes the file path as an argument and creates a binary representation of the DataFrame, which is then saved to the specified file.

In [None]:
import pandas as pd

# Assuming df is your DataFrame
df.to_pickle('data.pkl') # data.pkl / data.pickle

## Loading DataFrames from Pickle

To load a DataFrame from a pickle file, you can use the `read_pickle()` function in pandas. This function reads the binary data from the pickle file and converts it back into a DataFrame.


In [None]:
import pandas as pd

# Load DataFrame from pickle file
df = pd.read_pickle('data.pkl')