##### Pandas Series
A pandas series is a one dimentional array holding data for any type i.e. like a column in the table.

In [3]:
# Example
import pandas as pd
nums = [1,3,4,5,6]
example_series = pd.Series(nums)
print(example_series)

0    1
1    3
2    4
3    5
4    6
dtype: int64


##### Labels
If nothing is specified, the values are labeled with their index number. This label can be used to access a specified value. With index argument, you can name your own labels.

In [6]:
# Creating labels
import pandas as pd
nums = [1,3,4,5,6]
example_series = pd.Series(nums, index = ['A','B','C','D','E'])
print(example_series)


A    1
B    3
C    4
D    5
E    6
dtype: int64


##### Key/Value Objects as Series
Key/value object like a dictionary can be used to create a Series. The keys of the dictionary become the labels.

In [8]:
# Example
temp = { 'day1': 30, 'day2' : 35, 'day3' : 32}
temp_series = pd.Series(temp)
print(temp_series)

day1    30
day2    35
day3    32
dtype: int64


To select a subset of the dictionary, use the index argument and specify only the items you want to include in the Series.

In [9]:
temp = { 'day1': 30, 'day2' : 35, 'day3' : 32}
temp_series = pd.Series(temp, index = ['day2','day3'])
print(temp_series)
                        

day2    35
day3    32
dtype: int64


##### DataFrames
A Pandas DataFrame is a two dimentional data structure, like a two dimentional array, or a taxble with rows aand columns.

In [14]:
# Example
import pandas as pd
data = {
    "calories": [420, 380, 390],
    "duration": [50, 40, 45]
}

# load data into a DataFrame object
result = pd.DataFrame(data)

print(result)
    

   calories  duration
0       420        50
1       380        40
2       390        45


##### Locate Row
Pandas use the loc attribute to return one or more specified row(s).

In [15]:
# Example
print(result.loc[0])   # It returns a pandas series

calories    420
duration     50
Name: 0, dtype: int64


In [19]:
# To return more than one row, use a list of indexes
print(result.loc[[0,1]])

   calories  duration
0       420        50
1       380        40


##### Named Indexes

In [20]:
# with the index argument, you can name your own indexes
import pandas as pd
data = {
    "calories": [420, 380, 390],
    "duration": [50, 40, 45]
}

# load data into a DataFrame object
result = pd.DataFrame(data, index = ['day1','day2','day3'])

print(result)
    

      calories  duration
day1       420        50
day2       380        40
day3       390        45


##### Locate Named Indexes
use the named index in the loc attribute to return the specified row(s). 


In [21]:
# Example to return row 2
print(result.loc['day2'])

calories    380
duration     40
Name: day2, dtype: int64


##### Load Files Into a DataFrame
###### Read CSV Files

In [23]:
# Example 
import pandas as pd
df = pd.read_csv('data.csv')
print(df)                   # use to_string() method to print the entire DataFrame

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
..        ...    ...       ...       ...
164        60    105       140     290.8
165        60    110       145     300.0
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4

[169 rows x 4 columns]


# With to_string() method
print(df.to_string())       # it print the entire DataFrame

##### max_rows
The number of rows returned is defined in pandas option settings. Checked by pd.options.display.max_rows statement.

In [26]:
print(pd.options.display.max_rows)

60


- Which means that if the DataFrame contains more than 60 rows, the print(df) statement will return only the headers and the first and last 5 rows. You can change the maximum rows number with the same statement.

In [27]:
# Example
pd.options.display.max_rows = 9999
print(df)

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.0
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

##### Read JSON
- Note: JSON objects have the same format as python dictionaries

##### Analyzing DataFrames
###### Viewing the Data
- head() method - it returns the headers and a specified number of rows, starting from the top.
- if the number of rows is not specified, the head() method will return the top 5 rows.

In [2]:
# Example
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))

   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0
2        60    103       135     340.0
3        45    109       175     282.4
4        45    117       148     406.0
5        60    102       127     300.0
6        60    110       136     374.0
7        45    104       134     253.3
8        30    109       133     195.1
9        60     98       124     269.0


- the tail() method returns the headers and a specified number of rows, starting from the bottom
- If the number of rows is not specified it returns the last five rows

In [4]:
# Example
print(df.tail(10))

     Duration  Pulse  Maxpulse  Calories
159        30     80       120     240.9
160        30     85       120     250.4
161        45     90       130     260.4
162        45     95       130     270.0
163        45    100       140     280.9
164        60    105       140     290.8
165        60    110       145     300.0
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4


##### Info About the Data
- info() method gives more information about the data set

In [5]:
# Example
print(df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None


##### Data Cleaning
It means fixing bad data in your data set i.e. Empty cells, data in wrong format, wrong data, duplicates.

##### Empty cells
###### Remove rows
one way to deal with empty cells is to remove rows that contain empty cells.
- This is done using dropna() method which removes rows with empty cells.
- By default, the dropna() method returns a new DataFrame, and will not change the original.
-To change the original DataFrame, use the inplace =True argument, i.e. dropna(inplace = True)

The dropna(inplace = True) will not return a new DataFrame, but it will remove all rows containing NULL values from the original DataFrame.

###### Replace Empty Values
The fillna() method allows to replace empty cells with a value, i.e. fillna(value, inplace = True). This replaces all empty cells in the whole DataFrame with the same value.

###### Replace only for specified columns
To only replace empty values for one column, specify the column name for the DataFrame, i.e. df['temperature'].fillna(30, inplace = True), where df is the name of the DataFrame and temperature is the name of the column.

###### Replace Using Mean,Median or Mode
A common way to replace the empty cells is to calculate the mean, median, or mode value of the column. i.e.
- df = pd.read_csv('data.csv')   
- x = df['calories'].mean()     
- df['calories'].fillna(x, inplace = True)

 x calculate the mean and replaces empty values in the column with the mean



##### Cleaning Data of wrong format
There are two options:
- Remove the rows or,
- conbert all cells into the same format