In [6]:
import numpy as np
print(dir(np))
print(np.__doc__)


NumPy
=====

Provides
  1. An array object of arbitrary homogeneous items
  2. Fast mathematical operations over arrays
  3. Linear Algebra, Fourier Transforms, Random Number Generation

How to use the documentation
----------------------------
Documentation is available in two forms: docstrings provided
with the code, and a loose standing reference guide, available from
`the NumPy homepage <https://numpy.org>`_.

We recommend exploring the docstrings using
`IPython <https://ipython.org>`_, an advanced Python shell with
TAB-completion and introspection capabilities.  See below for further
instructions.

The docstring examples assume that `numpy` has been imported as ``np``::

  >>> import numpy as np

Code snippets are indicated by three greater-than signs::

  >>> x = 42
  >>> x = x + 1

Use the built-in ``help`` function to view a function's docstring::

  >>> help(np.sort)
  ... # doctest: +SKIP

For some objects, ``np.info(obj)`` may provide additional help.  This is
particularly 

In [2]:
import pandas as pd
print(dir(pd))

['ArrowDtype', 'BooleanDtype', 'Categorical', 'CategoricalDtype', 'CategoricalIndex', 'DataFrame', 'DateOffset', 'DatetimeIndex', 'DatetimeTZDtype', 'ExcelFile', 'ExcelWriter', 'Flags', 'Float32Dtype', 'Float64Dtype', 'Grouper', 'HDFStore', 'Index', 'IndexSlice', 'Int16Dtype', 'Int32Dtype', 'Int64Dtype', 'Int8Dtype', 'Interval', 'IntervalDtype', 'IntervalIndex', 'MultiIndex', 'NA', 'NaT', 'NamedAgg', 'Period', 'PeriodDtype', 'PeriodIndex', 'RangeIndex', 'Series', 'SparseDtype', 'StringDtype', 'Timedelta', 'TimedeltaIndex', 'Timestamp', 'UInt16Dtype', 'UInt32Dtype', 'UInt64Dtype', 'UInt8Dtype', '__all__', '__builtins__', '__cached__', '__doc__', '__docformat__', '__file__', '__git_version__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_built_with_meson', '_config', '_is_numpy_dev', '_libs', '_pandas_datetime_CAPI', '_pandas_parser_CAPI', '_testing', '_typing', '_version_meson', 'annotations', 'api', 'array', 'arrays', 'bdate_range', 'compat', 'conca

In [4]:
print(pd.__doc__)


pandas - a powerful data analysis and manipulation library for Python

**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.

Main Features
-------------
Here are just a few of the things that pandas does well:

  - Easy handling of missing data in floating point as well as non-floating
    point data.
  - Size mutability: columns can be inserted and deleted from DataFrame and
    higher dimensional objects
  - Automatic and explicit data alignment: objects can be explicitly aligned
    to a set of labels, or the user can simply ignore the labels and

In [7]:
# HOW TO GENERATE A DUMMY DATA
# GENERATING A TABLE OF CA SCORE FOR 50 STUDENTS CA 
# creating a sythentic student score data

col_name = ['CA1','CA2','CA3']
sit_no = np.arange(1,51)
np.random.seed(2)
score = np.random.randint(1,11,150)
score = score.reshape(50,3)
student_score = pd.DataFrame(data = score, index = sit_no, columns = col_name) # generating a table using panda note: index is used as the row in panda
student_score



Unnamed: 0,CA1,CA2,CA3
1,9,9,7
2,3,9,8
3,3,2,6
4,5,5,6
5,8,4,7
6,5,4,8
7,7,2,4
8,6,9,5
9,7,4,10
10,3,1,5


In [8]:
# selecting a single column from the dataframe
student_score['CA3']

1      7
2      8
3      6
4      6
5      7
6      8
7      4
8      5
9     10
10     5
11     2
12     3
13     8
14     9
15    10
16     1
17     9
18    10
19     7
20     4
21     2
22     2
23     6
24     5
25     5
26     4
27     6
28     1
29     6
30     5
31     2
32     9
33     3
34     2
35     2
36     2
37     1
38     6
39     2
40     2
41     9
42     8
43     2
44     3
45     5
46     2
47    10
48     3
49     8
50     9
Name: CA3, dtype: int32

In [12]:
# selecting multiple columns
student_score[['CA1','CA2']]

Unnamed: 0,CA1,CA2
1,9,9
2,3,9
3,3,2
4,5,5
5,8,4
6,5,4
7,7,2
8,6,9
9,7,4
10,3,1


In [15]:
student_score[['CA1','CA3']]

Unnamed: 0,CA1,CA3
1,9,7
2,3,8
3,3,6
4,5,6
5,8,7
6,5,8
7,7,4
8,6,5
9,7,10
10,3,5


In [16]:
# selecting a row or index using a set index value:
student_score.loc[20]

CA1    7
CA2    7
CA3    4
Name: 20, dtype: int32

In [18]:
# selecting multiple rows or index using a set index:
student_score.loc[[4,5,7,12,20]]

Unnamed: 0,CA1,CA2,CA3
4,5,5,6
5,8,4,7
7,7,2,4
12,8,9,3
20,7,7,4


In [19]:
# seleceting a row or index using a default index value (iloc)
student_score.iloc[15-1] # index number 14

CA1     6
CA2    10
CA3    10
Name: 15, dtype: int32

In [20]:
student_score.iloc[[10,34,23,20]] # index number 9,33,22,19:

Unnamed: 0,CA1,CA2,CA3
11,3,5,2
35,10,3,2
24,2,3,5
21,9,3,2


In [21]:
# creating a new additional colum at the end of the existing column
np.random.seed(2)
student_score['Exam Score'] = np.random.randint(25,71,50).reshape(50,1)
student_score

Unnamed: 0,CA1,CA2,CA3,Exam Score
1,9,9,7,65
2,3,9,8,40
3,3,2,6,70
4,5,5,6,33
5,8,4,7,47
6,5,4,8,68
7,7,2,4,43
8,6,9,5,36
9,7,4,10,65
10,3,1,5,32


In [23]:
# creating a new table at a specified location
np.random.seed(2)
student_score.insert(loc=0, value= np.random.randint(13,17,50).reshape(50,1), column = 'AGE')
student_score

Unnamed: 0,AGE,CA1,CA2,CA3,Exam Score
1,13,9,9,7,65
2,16,3,9,8,40
3,14,3,2,6,70
4,13,5,5,6,33
5,15,8,4,7,47
6,16,5,4,8,68
7,15,7,2,4,43
8,16,6,9,5,36
9,13,7,4,10,65
10,16,3,1,5,32


In [25]:
# creating a new column from an existing columns
student_score['Total Score'] = student_score['CA1'] + student_score['CA2'] + student_score['CA3'] + student_score['Exam Score']
student_score

Unnamed: 0,AGE,CA1,CA2,CA3,Exam Score,Total Score
1,13,9,9,7,65,90
2,16,3,9,8,40,60
3,14,3,2,6,70,81
4,13,5,5,6,33,49
5,15,8,4,7,47,66
6,16,5,4,8,68,85
7,15,7,2,4,43,56
8,16,6,9,5,36,56
9,13,7,4,10,65,86
10,16,3,1,5,32,41


In [27]:
def grade(totalscore):
    if totalscore >= 70 and totalscore <=100:
        return 'A'
    elif totalscore >= 60 and totalscore <=69.99:
        return 'B'
    elif totalscore >= 50 and totalscore <=59.99:
        return 'C'
    elif totalscore >= 40 and totalscore <=49.99:
        return 'D'
    elif totalscore >= 30 and totalscore <=39.99:
        return 'E'
    else:
        return 'F'
student_score['Grade'] = student_score['Total Score'].apply(grade)
student_score

Unnamed: 0,AGE,CA1,CA2,CA3,Exam Score,Total Score,Grade
1,13,9,9,7,65,90,A
2,16,3,9,8,40,60,B
3,14,3,2,6,70,81,A
4,13,5,5,6,33,49,D
5,15,8,4,7,47,66,B
6,16,5,4,8,68,85,A
7,15,7,2,4,43,56,C
8,16,6,9,5,36,56,C
9,13,7,4,10,65,86,A
10,16,3,1,5,32,41,D


In [None]:
# dropping a row or column

# how to drop a column or multiple column from a dataframe
# student_score.drop('column name', axis = 0 or 1, inplace = True)
# student_score.drop(['A','B'], axis = 0 or 1, inplace = True)

In [28]:
student_score.head()

Unnamed: 0,AGE,CA1,CA2,CA3,Exam Score,Total Score,Grade
1,13,9,9,7,65,90,A
2,16,3,9,8,40,60,B
3,14,3,2,6,70,81,A
4,13,5,5,6,33,49,D
5,15,8,4,7,47,66,B


In [30]:
student_score.tail()

Unnamed: 0,AGE,CA1,CA2,CA3,Exam Score,Total Score,Grade
46,14,7,4,2,40,53,C
47,16,9,6,10,66,91,A
48,15,6,5,3,70,84,A
49,13,8,9,8,33,58,C
50,13,4,5,9,42,60,B


In [31]:
student_score.shape

(50, 7)

In [33]:
student_score.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50 entries, 1 to 50
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   AGE          50 non-null     int32 
 1   CA1          50 non-null     int32 
 2   CA2          50 non-null     int32 
 3   CA3          50 non-null     int32 
 4   Exam Score   50 non-null     int32 
 5   Total Score  50 non-null     int32 
 6   Grade        50 non-null     object
dtypes: int32(6), object(1)
memory usage: 3.3+ KB


In [34]:
student_score.describe()

Unnamed: 0,AGE,CA1,CA2,CA3,Exam Score,Total Score
count,50.0,50.0,50.0,50.0,50.0,50.0
mean,14.76,5.78,5.82,5.22,48.94,65.76
std,1.204752,2.873453,2.854928,2.880547,14.53414,15.218893
min,13.0,1.0,1.0,1.0,27.0,39.0
25%,14.0,3.0,4.0,2.0,36.0,53.75
50%,15.0,6.0,6.0,5.0,48.0,65.0
75%,16.0,8.0,8.75,8.0,63.75,77.75
max,16.0,10.0,10.0,10.0,70.0,91.0


In [40]:
student_score.head().transpose()

Unnamed: 0,1,2,3,4,5
AGE,13,16,14,13,15
CA1,9,3,3,5,8
CA2,9,9,2,5,4
CA3,7,8,6,6,7
Exam Score,65,40,70,33,47
Total Score,90,60,81,49,66
Grade,A,B,A,D,B


In [41]:
student_score.transpose()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,41,42,43,44,45,46,47,48,49,50
AGE,13,16,14,13,15,16,15,16,13,16,...,13,15,15,16,16,14,16,15,13,13
CA1,9,3,3,5,8,5,7,6,7,3,...,9,2,1,6,1,7,9,6,8,4
CA2,9,9,2,5,4,4,2,9,4,1,...,7,6,10,10,1,4,6,5,9,5
CA3,7,8,6,6,7,8,4,5,10,5,...,9,8,2,3,5,2,10,3,8,9
Exam Score,65,40,70,33,47,68,43,36,65,32,...,51,40,64,33,70,40,66,70,33,42
Total Score,90,60,81,49,66,85,56,56,86,41,...,76,56,77,52,77,53,91,84,58,60
Grade,A,B,A,D,B,A,C,C,A,D,...,A,C,A,C,A,C,A,A,C,B


In [42]:
# getting all the column names
student_score.columns

Index(['AGE', 'CA1', 'CA2', 'CA3', 'Exam Score', 'Total Score', 'Grade'], dtype='object')

In [44]:
# RENAMING COLUMNS
student_score.rename({'CA1':'First_CA', 'CA2':'Second_CA', 'CA3':'Third_CA'}, axis = 1, inplace=True)
student_score.columns

Index(['AGE', 'First_CA', 'Second_CA', 'Third_CA', 'Exam Score', 'Total Score',
       'Grade'],
      dtype='object')

In [45]:
student_score.head()

Unnamed: 0,AGE,First_CA,Second_CA,Third_CA,Exam Score,Total Score,Grade
1,13,9,9,7,65,90,A
2,16,3,9,8,40,60,B
3,14,3,2,6,70,81,A
4,13,5,5,6,33,49,D
5,15,8,4,7,47,66,B


In [46]:
# replacing a value
student_score['AGE'].loc[1]

13

In [47]:
student_score['AGE'].loc[1] = 16
student_score.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  student_score['AGE'].loc[1] = 16


Unnamed: 0,AGE,First_CA,Second_CA,Third_CA,Exam Score,Total Score,Grade
1,16,9,9,7,65,90,A
2,16,3,9,8,40,60,B
3,14,3,2,6,70,81,A
4,13,5,5,6,33,49,D
5,15,8,4,7,47,66,B


In [49]:
student_score['Grade'].unique()

array(['A', 'B', 'D', 'C', 'E'], dtype=object)

In [50]:
student_score['Grade'].nunique()

5

In [51]:
student_score['Grade'].value_counts()

Grade
A    21
C    10
B     9
D     9
E     1
Name: count, dtype: int64

In [53]:
df = pd.read_csv('sample_pivot.csv') # df is a short form for data frame
df.head()

Unnamed: 0,Date,Region,Type,Units,Sales
0,7/11/2020,East,Children's Clothing,18.0,306
1,9/23/2020,North,Children's Clothing,14.0,448
2,4/2/2020,South,Women's Clothing,17.0,425
3,2/28/2020,East,Children's Clothing,26.0,832
4,3/19/2020,West,Women's Clothing,3.0,33


In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    1000 non-null   object 
 1   Region  1000 non-null   object 
 2   Type    1000 non-null   object 
 3   Units   911 non-null    float64
 4   Sales   1000 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 39.2+ KB


In [56]:
# checking for missing values


df.isnull().sum()

Date       0
Region     0
Type       0
Units     89
Sales      0
dtype: int64

In [57]:
# tackling missing values
# DIFFERENT FILLING TECHNIQUES
# 1. FOWARD FILLING TECHNIQUE
df.tail()

Unnamed: 0,Date,Region,Type,Units,Sales
995,2/11/2020,East,Children's Clothing,35.0,735
996,12/25/2020,North,Men's Clothing,,1155
997,8/31/2020,South,Men's Clothing,13.0,208
998,8/23/2020,South,Women's Clothing,17.0,493
999,8/17/2020,North,Women's Clothing,25.0,300


In [58]:
df1 = df.copy()

In [59]:
df1

Unnamed: 0,Date,Region,Type,Units,Sales
0,7/11/2020,East,Children's Clothing,18.0,306
1,9/23/2020,North,Children's Clothing,14.0,448
2,4/2/2020,South,Women's Clothing,17.0,425
3,2/28/2020,East,Children's Clothing,26.0,832
4,3/19/2020,West,Women's Clothing,3.0,33
...,...,...,...,...,...
995,2/11/2020,East,Children's Clothing,35.0,735
996,12/25/2020,North,Men's Clothing,,1155
997,8/31/2020,South,Men's Clothing,13.0,208
998,8/23/2020,South,Women's Clothing,17.0,493


In [60]:
df1.fillna(method = 'ffill', inplace = True)
df1.tail()

  df1.fillna(method = 'ffill', inplace = True)


Unnamed: 0,Date,Region,Type,Units,Sales
995,2/11/2020,East,Children's Clothing,35.0,735
996,12/25/2020,North,Men's Clothing,35.0,1155
997,8/31/2020,South,Men's Clothing,13.0,208
998,8/23/2020,South,Women's Clothing,17.0,493
999,8/17/2020,North,Women's Clothing,25.0,300


In [62]:
df1 = df.copy()
df1.fillna(method = 'bfill',  inplace = True)
df1.tail()

  df1.fillna(method = 'bfill',  inplace = True)


Unnamed: 0,Date,Region,Type,Units,Sales
995,2/11/2020,East,Children's Clothing,35.0,735
996,12/25/2020,North,Men's Clothing,13.0,1155
997,8/31/2020,South,Men's Clothing,13.0,208
998,8/23/2020,South,Women's Clothing,17.0,493
999,8/17/2020,North,Women's Clothing,25.0,300


In [None]:
# NOTE: THE FORWARD FILL AND THE BACKWARD FILL IS NOT ADVISABLE TO USE BECAUSE OF THE VARIABLILTY OF DATA.
# FILLING VALUES USING STATISTICAL METHODS

In [None]:
# USING STATISTICAL METHOD:

In [63]:
df2 = df.copy()
df2['Units'].fillna(df2['Units'].mean(), inplace = True)
df2.tail()

Unnamed: 0,Date,Region,Type,Units,Sales
995,2/11/2020,East,Children's Clothing,35.0,735
996,12/25/2020,North,Men's Clothing,19.638858,1155
997,8/31/2020,South,Men's Clothing,13.0,208
998,8/23/2020,South,Women's Clothing,17.0,493
999,8/17/2020,North,Women's Clothing,25.0,300


In [65]:
# DROPPING MISSING VALUES
df.dropna(inplace = True)
df.isnull().sum()

Date      0
Region    0
Type      0
Units     0
Sales     0
dtype: int64