## `Pandas`
- used for cleaning, exploring, manipulating and analyzing the data.

#### &emsp;`why to use?`
- &emsp;&emsp; can analyze huge amount of data and derive statistical inferences.
- &emsp;&emsp; transforms messy data into readable and relevant.

#### &emsp; `Package installation`
```python
    # Type
    pip install pandas
    # in command prompt after activating the environment.
```

In [None]:
# import in file to use its features.
import pandas as pd

# check version
print(pd.__version__)

#### `Datatypes`
- Series
- Dataframe

### `Series`
- like a column in table.
- can hold data of any type.
- can hold 1-D array.

#### `Create series from list`

In [None]:
arr = [1, 2, 3]
series_1 = pd.Series(arr)

print(type(series_1))
print(series_1)

#### `Labels`
- If no labels are provided to the values, they are indexed with index number.
- The values can be annotated with labels explicitly
- Values can be accessed using index or labels (if provided).

#### `Create series from list along with labels.`

In [None]:
arr = [1, 2, 3]
labels = ['first', 'second', 'third']
series_2 = pd.Series(arr, index=labels)

print(type(series_2))
print(series_2)

#### `Access values`

In [None]:
print(f"first value of series_1: {series_1[0]}")

print(f"first value of series_2: {series_2['first']}")

#### `Create series from dictionary`

In [None]:
places_area = {'Kathmandu':1234, 'Pokhara':2345, 'Dharan':3456}

series_3 = pd.Series(places_area)
print(series_3)

# keys in dictionary becomes labels in series

#### `Create series from subset of dictionary`

In [None]:
places_area = {'Kathmandu':1234, 'Pokhara':2345, 'Dharan':3456}

series_4 = pd.Series(places_area, index=['Kathmandu', 'Pokhara'])
print(series_4)

### `DataFrame`
- like a table.
- 2-D data structure having rows and columns.

#### `Create a dataframe from dictionary`

In [None]:
subject_marks = {
    'english' : [50, 60, 70, 80 , 90],
    'math' : [51, 53, 55, 52, 50],
    'science' : [80, 81, 82, 83, 84],
    'computer' : [90, 91, 92, 93, 94]
}

df = pd.DataFrame(subject_marks)
print(df)

#### `Locate rows in DataFrame`
- uses `loc` attribute to get rows

In [None]:
# integer as index
row = df.loc[0]
print(type(row))
print(row)

In [None]:
# list as index
row = df.loc[[0]]
print(type(row))
print(row)

#### `Named Indexes`

In [None]:
# Create a dataframe from dictionary
subject_marks = {
    'english' : [50, 60, 70, 80 , 90],
    'math' : [51, 53, 55, 52, 50],
    'science' : [80, 81, 82, 83, 84],
    'computer' : [90, 91, 92, 93, 94]
}

labels = ['student1','student2','student3','student4','student5']

df = pd.DataFrame(subject_marks, index=labels)
print(df)

#### `Access rows using labels`

In [None]:
row = df.loc['student1']
print(type(row))
print(row)

In [None]:
row = df.loc[['student1']]
print(type(row))
print(row)

#### `Access columns using column names`

In [None]:
# List out the names of columns in dataframe
print('Before adding column(s)\n', df.columns)

In [None]:
# Adding column for subject 'social'
df['social'] = [1, 2, 3, 4, 5]
print('After adding column(s)\n', df.columns)

In [None]:
df.head()

#### `Find n-largest and n-smallest values in columns`

In [None]:
# find n-number of largest values based on particular column
print(df.nlargest(3, ['english']))

In [None]:
# find n-number of largest values based on particular column
print(df.nsmallest(3, ['english']))

#### `Convert dataframe into numpy array`

In [None]:
import pandas as pd
 
# initialize a dataframe
df = pd.DataFrame(
        [[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9],
        [10, 11, 12]],
        columns=['a', 'b', 'c']
    )
df.head()

In [None]:
# convert dataframe to numpy array
arr = df.to_numpy()
 
print('Numpy Array \n', arr)
print('Type of array: ', type(arr))
print('Type of elements: ', arr.dtype)

In [None]:
# creating series using column of dataframe
series = pd.Series(df['a'].head())
arr = series.to_numpy()

print('Numpy Array \n', arr)
print('Type of array: ', type(arr))
print('Type of elements: ', arr.dtype)

#### `Load Data from CSV file`
- uses pd.read_csv() to load into dataframe.

In [None]:
# Load csv data as DataFrame
df = pd.read_csv('./csv_files/organizations-10000.csv')

In [None]:
# Remove any of the column using column name
df = df.drop(['Index'], axis=1)

In [None]:
# statistical information of dataframe
df.describe()

In [None]:
# Datatype, count, and non-null information
df.info()

In [None]:
# first 5 rows
df.head()

In [None]:
# last 5 rows
df.tail()

In [None]:
# Remove rows with empty cell returning new dataframe
new_df = df.dropna()
len(new_df)

In [None]:
# Remove rows with empty cell making change in original dataframe
df.dropna(inplace=True)

In [None]:
# Removing rows with empty cell of particular column
df.dropna(subset=['Website'], inplace=True)

In [None]:
# replace all the NaN values in dataframe
df.fillna(0, inplace=True)

In [None]:
# replace the NaN values of particular column
df['Country'].fillna(0, inplace=True)

In [None]:
# Replace statistical values: mean, median, mode
num_emp_mean = df['Number of employees'].mean()
df['Number of employees'].fillna(num_emp_mean, inplace=True)

In [None]:
df

#### `Column Selection of Dataframe`

In [None]:
# Listing all the columns in dataframe
df.columns

In [None]:
# Selecting single column as Series
name = df['Name']
print(name, type(name))

In [None]:
name = df.Name
print(name, type(name))

In [None]:
# selecting single column as DataFrame
name = df[['Name']]
print(name, type(name))

#### `Column Removal`

In [None]:
df.columns

In [None]:
# Remove any of the column using column name
df.drop(['Website', 'Description'], axis=1, inplace=True)

In [None]:
df.columns

In [None]:
# Rename the column names with rename() method
df.rename(
    columns = {
        'Organization Id':'org_id',
        'Number of employees':'num_employees'
    },
    inplace = True
)

In [None]:
df.columns

#### `Adding new rows in dataframe`

In [None]:
# Add new row
new_data = {'org_id':'aaaaaa', 'Name':'SomeName', 'Country':'Nepal', 'Founded':2015, 'Industry':'software',
       'num_employees':50}

new_row = pd.DataFrame(new_data, index=[0])
df = pd.concat([new_row, df]).reset_index(drop=True)
df.head()

#### `Removal of duplicate rows`

In [None]:
## Run above cell more than once to create duplicate rows
# remove duplicated rows
dup = df.drop_duplicates().reset_index(drop=True)
dup

In [None]:
## Remove duplicates in particular column
df.drop_duplicates(subset=['Country'])

In [None]:
# Remove duplicates on specific column(s) and
# keep last occurance rather than first
df.drop_duplicates(subset=['Founded', 'Industry'], keep='last')

#### `Filter dataframe based on condition`

In [None]:
df.head()

In [None]:
condition = df['Country']=='Nepal'
condition

In [None]:
df[condition].head()

#### `Retrive unique values on columns`

In [None]:
# unique countries in column 'Country'
df['Country'].unique()

In [None]:
# number of unique values in column
df['Country'].nunique(dropna=True)

#### `Retrive index number of column`

In [None]:
index_num = df.columns.get_loc('Country')
print(index_num)

#### `Convert series into dataframe`

In [None]:
# Creating series
quantities =  [60, 20, 40, 90]
labels = ['apple', 'realme', 'oppo', 'xiaomi']
s = pd.Series(quantities, index=labels)
s

In [None]:
# Conversion into dataframe
col_name = "mobile brand"
s_df = s.to_frame(name=col_name)

print(type(s), f"\n{' '*18}to\n", type(s_df))
s_df

In [None]:
# reset the index
s_df = s_df.reset_index()
s_df

In [None]:
# drop the index column
s_df = s_df.drop(['index'], axis=1)
s_df