### Using Pandas to Get Familiar With Your Data


***The first step in any machine learning project is familiarize yourself with the data. You'll use the Pandas library for this. Pandas is the primary tool data scientists use for exploring and manipulating data. Most people abbreviate pandas in their code as pd. We do this with the command***

What kind of data does pandas handle ?

In [1]:
#import modules
import pandas as pd
import numpy as np

In [2]:
#Checking version in pandas
print(pd.__version__)

1.2.4


In [3]:
# display all pandas properties 
dir(pd)

['BooleanDtype',
 'Categorical',
 'CategoricalDtype',
 'CategoricalIndex',
 'DataFrame',
 'DateOffset',
 'DatetimeIndex',
 'DatetimeTZDtype',
 'ExcelFile',
 'ExcelWriter',
 'Flags',
 'Float32Dtype',
 'Float64Dtype',
 'Float64Index',
 'Grouper',
 'HDFStore',
 'Index',
 'IndexSlice',
 'Int16Dtype',
 'Int32Dtype',
 'Int64Dtype',
 'Int64Index',
 'Int8Dtype',
 'Interval',
 'IntervalDtype',
 'IntervalIndex',
 'MultiIndex',
 'NA',
 'NaT',
 'NamedAgg',
 'Period',
 'PeriodDtype',
 'PeriodIndex',
 'RangeIndex',
 'Series',
 'SparseDtype',
 'StringDtype',
 'Timedelta',
 'TimedeltaIndex',
 'Timestamp',
 'UInt16Dtype',
 'UInt32Dtype',
 'UInt64Dtype',
 'UInt64Index',
 'UInt8Dtype',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__docformat__',
 '__file__',
 '__getattr__',
 '__git_version__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_config',
 '_hashtable',
 '_is_numpy_dev',
 '_lib',
 '_libs',
 '_np_version_under1p17',
 '_np_version_under1p18',
 '_testing'

 #### Pandas can be grouped into two data structure catergories:
 * 1. Series
 * 2. Dataframe

## Pandas Series

* Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers,
* Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call

***Here is an illustrative examples of the total points obtained from the top six 2021/2022 European league winners.***


In [4]:
#initializing a variable t6_pts to store the points obtained
t6_pts = pd.Series([93,86,77,89,86,83])
t6_pts

0    93
1    86
2    77
3    89
4    86
5    83
dtype: int64

In [5]:
# Declaring a name to our series data
t6_pts.name = 'Top six league winners'
t6_pts

0    93
1    86
2    77
3    89
4    86
5    83
Name: Top six league winners, dtype: int64

In [6]:
#Displaying the datatype and values
t6_pts.dtype


dtype('int64')

In [7]:
t6_pts.values

array([93, 86, 77, 89, 86, 83])

In [8]:
#Displaying the first index 
t6_pts[0]


93

In [9]:
t6_pts.index

RangeIndex(start=0, stop=6, step=1)

In [10]:
#Matching various teams to their respective index

t6_pts.index = [
    
'Manchester City',
'Real Madrid',
'Bayern Muchen',
'AC Milan',
'PSG',
'Ajax',
    
    
    
    
    
    
]

t6_pts

Manchester City    93
Real Madrid        86
Bayern Muchen      77
AC Milan           89
PSG                86
Ajax               83
Name: Top six league winners, dtype: int64

In [11]:
#alternative way using t6_pts = pd.Series(data, index=index)

pd.Series(
    [93, 86, 77, 89, 86, 83],
    index=['Manchester City','Real Madrid', 'Bayern Muchen', 'AC Milan', 'PSG', 'Ajax'],
    name='Top six league winners')

Manchester City    93
Real Madrid        86
Bayern Muchen      77
AC Milan           89
PSG                86
Ajax               83
Name: Top six league winners, dtype: int64

In [12]:
#Displaying some part of the data 
pd.Series(t6_pts, index=['Real Madrid', 'Bayern Muchen', 'AC Milan', ])


Real Madrid      86
Bayern Muchen    77
AC Milan         89
Name: Top six league winners, dtype: int64

### Indexing 

In [13]:
#Dispalying the info about data
t6_pts

Manchester City    93
Real Madrid        86
Bayern Muchen      77
AC Milan           89
PSG                86
Ajax               83
Name: Top six league winners, dtype: int64

In [14]:
t6_pts[['Manchester City','Ajax']]

Manchester City    93
Ajax               83
Name: Top six league winners, dtype: int64

In [15]:
t6_pts.iloc[[0,1]]

Manchester City    93
Real Madrid        86
Name: Top six league winners, dtype: int64

In [16]:
t6_pts['Real Madrid':'PSG']

Real Madrid      86
Bayern Muchen    77
AC Milan         89
PSG              86
Name: Top six league winners, dtype: int64

In [17]:
t6_pts[0:1]

Manchester City    93
Name: Top six league winners, dtype: int64

### Conditional Boolean

In [18]:
t6_pts
t6_pts>70


Manchester City    True
Real Madrid        True
Bayern Muchen      True
AC Milan           True
PSG                True
Ajax               True
Name: Top six league winners, dtype: bool

In [19]:
t6_pts.mean()


85.66666666666667

In [20]:
t6_pts[t6_pts>t6_pts.mean()]

Manchester City    93
Real Madrid        86
AC Milan           89
PSG                86
Name: Top six league winners, dtype: int64

In [21]:
t6_pts.std()

5.428320796219276

In [22]:
t6_pts[(t6_pts > t6_pts.mean() - t6_pts.std() / 2) | (t6_pts> t6_pts.mean() + t6_pts.std() / 2)]

Manchester City    93
Real Madrid        86
AC Milan           89
PSG                86
Ajax               83
Name: Top six league winners, dtype: int64

### Operations and methods

In [23]:
t6_pts*100

Manchester City    9300
Real Madrid        8600
Bayern Muchen      7700
AC Milan           8900
PSG                8600
Ajax               8300
Name: Top six league winners, dtype: int64

In [24]:
np.log(t6_pts)

Manchester City    4.532599
Real Madrid        4.454347
Bayern Muchen      4.343805
AC Milan           4.488636
PSG                4.454347
Ajax               4.418841
Name: Top six league winners, dtype: float64

In [25]:
t6_pts['Real Madrid': 'PSG'].mean()

84.5

### Boolean Array

In [26]:
t6_pts
t6_pts>85

Manchester City     True
Real Madrid         True
Bayern Muchen      False
AC Milan            True
PSG                 True
Ajax               False
Name: Top six league winners, dtype: bool

In [27]:
t6_pts[t6_pts>85]

Manchester City    93
Real Madrid        86
AC Milan           89
PSG                86
Name: Top six league winners, dtype: int64

### Another illustrative example 

#### an ndarray
. a scalar value (like 5)


In [30]:
s =pd.Series(np.random.randn(5), index=["a","b", "c", "d", "e"])
s

a    1.608945
b   -0.669065
c   -1.394876
d   -1.399377
e    0.662107
dtype: float64

In [31]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [32]:
pd.Series(np.random.randn(5))

0    0.745449
1   -0.645969
2   -1.282759
3    0.672198
4   -2.568809
dtype: float64

### Dataframe creation

In [33]:
 df = pd.DataFrame({
    'Points': [93, 66, 77, 89, 86, 83],
    'Country': [
        'England',
        'Spain',
        'Germany',
        'Italy',
        'France',
        'Nedtherlands'
    ],
    'Goal scored': [
        99,
        80,
        86,
        89,
        90,
        98

    ]
    
    
}, columns=['Points', 'Country', 'Goal scored'])
    
    

In [34]:
df

Unnamed: 0,Points,Country,Goal scored
0,93,England,99
1,66,Spain,80
2,77,Germany,86
3,89,Italy,89
4,86,France,90
5,83,Nedtherlands,98


In [36]:
df.index = [
    'Manchester City',
    'Real Madrid',
    'Bayern Muchen',
    'Ac Milan',
    'PSG',
    'Ajax',
    
]
df

Unnamed: 0,Points,Country,Goal scored
Manchester City,93,England,99
Real Madrid,66,Spain,80
Bayern Muchen,77,Germany,86
Ac Milan,89,Italy,89
PSG,86,France,90
Ajax,83,Nedtherlands,98


In [37]:
df.columns

Index(['Points', 'Country', 'Goal scored'], dtype='object')

In [38]:
df.index

Index(['Manchester City', 'Real Madrid', 'Bayern Muchen', 'Ac Milan', 'PSG',
       'Ajax'],
      dtype='object')

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, Manchester City to Ajax
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Points       6 non-null      int64 
 1   Country      6 non-null      object
 2   Goal scored  6 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 192.0+ bytes


In [41]:
df.size

18

In [43]:
df.shape

(6, 3)

In [44]:
df.describe()

Unnamed: 0,Points,Goal scored
count,6.0,6.0
mean,82.333333,90.333333
std,9.667816,7.229569
min,66.0,80.0
25%,78.5,86.75
50%,84.5,89.5
75%,88.25,96.0
max,93.0,99.0


In [45]:
df.dtypes

Points          int64
Country        object
Goal scored     int64
dtype: object

In [47]:
df.dtypes.value_counts()

int64     2
object    1
dtype: int64


### Indexing , Selection and Slicing

In [48]:
df

Unnamed: 0,Points,Country,Goal scored
Manchester City,93,England,99
Real Madrid,66,Spain,80
Bayern Muchen,77,Germany,86
Ac Milan,89,Italy,89
PSG,86,France,90
Ajax,83,Nedtherlands,98


In [49]:
df.iloc[0]

Points              93
Country        England
Goal scored         99
Name: Manchester City, dtype: object

In [50]:
df.loc["Manchester City"]

Points              93
Country        England
Goal scored         99
Name: Manchester City, dtype: object

In [51]:
df["Country"]

Manchester City         England
Real Madrid               Spain
Bayern Muchen           Germany
Ac Milan                  Italy
PSG                      France
Ajax               Nedtherlands
Name: Country, dtype: object

In [52]:
df['Points']

Manchester City    93
Real Madrid        66
Bayern Muchen      77
Ac Milan           89
PSG                86
Ajax               83
Name: Points, dtype: int64

In [53]:
df['Country'].to_frame()

Unnamed: 0,Country
Manchester City,England
Real Madrid,Spain
Bayern Muchen,Germany
Ac Milan,Italy
PSG,France
Ajax,Nedtherlands


In [54]:
df[['Country', 'Goal scored']]

Unnamed: 0,Country,Goal scored
Manchester City,England,99
Real Madrid,Spain,80
Bayern Muchen,Germany,86
Ac Milan,Italy,89
PSG,France,90
Ajax,Nedtherlands,98


In [55]:
df[1:3]

Unnamed: 0,Points,Country,Goal scored
Real Madrid,66,Spain,80
Bayern Muchen,77,Germany,86


In [56]:
df.loc['Real Madrid': 'PSG']

Unnamed: 0,Points,Country,Goal scored
Real Madrid,66,Spain,80
Bayern Muchen,77,Germany,86
Ac Milan,89,Italy,89
PSG,86,France,90


In [57]:
df.loc['Real Madrid': 'PSG' ,'Goal scored']

Real Madrid      80
Bayern Muchen    86
Ac Milan         89
PSG              90
Name: Goal scored, dtype: int64

In [58]:
df.iloc[[0, 1, -1]]

Unnamed: 0,Points,Country,Goal scored
Manchester City,93,England,99
Real Madrid,66,Spain,80
Ajax,83,Nedtherlands,98


In [59]:
df.iloc[0:3, 2]

Manchester City    99
Real Madrid        80
Bayern Muchen      86
Name: Goal scored, dtype: int64