# Lab 05 - Pandas Library - Florentin Degbo

#### Pandas Library:
1. Pandas is another powerful Python library in manipulating data,
2. It is built on top of NumPy,
3. While NumPy handles mainly numeric dtypes, Pandas librairy handles both string and numeric values,
4. There are two main data structures in Pandas: Series and DataFrames
     - Series are 1-D (one-dimensional) structure,
     - DataFrames (df) are 2-D (two dimensional) structures: 2_D is a tabular dataset (having rows and columns).
5. The most reliable source for Pandas is: https://pandas.pydata.org/

### Installing an Importing Pandas Library

In [4]:
# le's install pandas
!pip install pandas



In [5]:
# let's import pandas
import numpy as np
import pandas as pd # pd is the alias for pandas

In [6]:
# let's check the version of pandas we are using here
print(pd.__version__)

2.2.2


### Data Structures in Pandas
1. Series: It is made up two parts:
    - Index (label)
    - The values (1-D column)
2. DataFrames (df): It is made up of rows and columns

#### 1. Series

In [9]:
# 1. let's create a Pandas Series using a Python list # pandas by default assign index but not part of the dataset
s = pd.Series([67, 45, 23, 100, 105], name='Grades')
s

0     67
1     45
2     23
3    100
4    105
Name: Grades, dtype: int64

In [10]:
# let's see the type of s
type(s)

pandas.core.series.Series

In [11]:
# 2. let's create a Pandas Series using a Python dictionary #indexing has been assigned reason why it doesnt start with 0
s1 = pd.Series({1:'swan', 2:'dove', 'Thanksgiving':'turkey', 4:'parrot', 'Degbo':'eagle'}, name='birds')
s1

1                 swan
2                 dove
Thanksgiving    turkey
4               parrot
Degbo            eagle
Name: birds, dtype: object

In [12]:
type(s1)

pandas.core.series.Series

In [13]:
# let's see the index of s1
s1.index

Index([1, 2, 'Thanksgiving', 4, 'Degbo'], dtype='object')

In [14]:
s1.ndim

1

### Indexing and Slicing Pandas Series

In [16]:
# let's have eagle as the output: indexing a series
s1['Degbo']

'eagle'

In [17]:
# slicing a series: display the first 3 elements of s
s[:3]

0    67
1    45
2    23
Name: Grades, dtype: int64

#### 2. DataFrames (df)
1. A Pandas' DataFrame is a 2- array with rows and columns,
2. DataFrames are the most common used daa structures,
3. There are two axes in a DataFrame: axis=0 (rows), axis=1 (columns),
4. Python dictionaries are the only 2-D data structures in Python,
5. Pandas DataFrames can be created from a collection of Python dictionaries,


In [19]:
# 1. creating df using Python dictionaries 
d = [{100:'swan', 2:'dove', 'Thanksgiving':'turkey', 4:'parrot', 'Degbo':'eagle'},
{1:'a', 2:'b', 3:'c', 4:'d', 5:'e'}]
d

[{100: 'swan',
  2: 'dove',
  'Thanksgiving': 'turkey',
  4: 'parrot',
  'Degbo': 'eagle'},
 {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}]

In [20]:
# let's convert d list to a dataframe named df #NaN type of missing values
df_1 = pd.DataFrame(d)
df_1

Unnamed: 0,100,2,Thanksgiving,4,Degbo,1,3,5
0,swan,dove,turkey,parrot,eagle,,,
1,,b,,d,,a,c,e


In [21]:
type(df_1)

pandas.core.frame.DataFrame

#### Observations
 - Add 5 points
1. A DataFrame is created from a list of dictionaries, with each dictionary representing a row and keys as column labels.
2. Pandas automatically fills missing data with NaN where applicable.
3. Columns are aligned based on their keys, and missing keys are filled with NaN.
4. The DataFrame is of type pandas.core.frame.DataFrame, which can handle both numeric and string data.
5. The DataFrame structure enables easy data manipulation, including row and column access.

In [23]:
# 2. creating a df using numpy arrays
array = np.array([[45,85000], [24,45000], [78,102000], [19,24000], [58,120000]])
array


array([[    45,  85000],
       [    24,  45000],
       [    78, 102000],
       [    19,  24000],
       [    58, 120000]])

In [24]:
# let's convert df_array into a pandas **IN FINAL EXAM**
df_2 = pd.DataFrame(array, columns=['Age', 'Income'], 
                       index=['Sam', 'Mike', 'Lily', 'Sarah', 'Degbo'])
df_2

Unnamed: 0,Age,Income
Sam,45,85000
Mike,24,45000
Lily,78,102000
Sarah,19,24000
Degbo,58,120000


#### Observations
 - Add 5 points

### Data Attributes: Add the information hat each attribute provides: 
- Data attributes provide more information about the dataset. Some of them are as
follows: df_2:
- shape: (5, 2)
- ndim: 2
- size: 10
- dtypes: Object
- columns: 'Age', 'Income'
- index: 'Sam', 'Mike', 'Lily', 'Sarah', 'Degbo'

In [27]:
#shape
df_2.shape

(5, 2)

In [28]:
# ndim
df_2.ndim

2

In [None]:
# size
df_2.size

In [None]:
# dtypes
df_2.dtypes

In [None]:
# columns
df_2.columns

In [None]:
# index
df_2.index

In [None]:
# let's load the movies dataset
df = pd.read_csv('movies.csv')

In [None]:
# making sure the dataset has been loaded
df.head(8)

In [None]:
df.tail()

### Conclusion
- ADD 15 points here
1. Pandas is a powerful library for data manipulation in Python, especially for structured data like tabular datasets.
2. DataFrames can be created from various data sources such as dictionaries, lists, and CSV files.
3. Each DataFrame has an index to uniquely identify rows and facilitate efficient data access.
4. Missing data is represented as NaN, making it easy to identify and handle gaps in data.
5. Pandas can read data from different file formats, including CSV, Excel, and SQL databases.
6. DataFrames support various data types, such as integers, floats, and strings, within a single structure.
7. Data selection is easy with indexing, allowing users to filter and retrieve specific data subsets.
8. Missing values can be handled using methods like .fillna() or .dropna() to clean the data.
9. Data can be transformed using functions like .apply(), .map(), and .transform() for customization.
10. Pandas supports aggregation and grouping for generating summary statistics like sums and averages.
11. DataFrames can be merged using the .merge() function, allowing for combining multiple datasets.
12. Pandas handles large datasets efficiently with optimized performance through vectorized operations.
13. It integrates with Matplotlib for plotting and visualizing data directly from DataFrames.
14. Columns are aligned based on their keys, and missing keys are filled with NaN.
15. Methods like .head() and .tail() allow users to quickly preview the first and last rows of a dataset.

##### End of Lab 05 