### Using Pandas to Get Familiar With Your Data


***The first step in any machine learning project is familiarize yourself with the data. You'll use the Pandas library for this. Pandas is the primary tool data scientists use for exploring and manipulating data. Most people abbreviate pandas in their code as pd. We do this with the command***

What kind of data does pandas handle ?

In [21]:
#import modules
import pandas as pd
import numpy as np

In [22]:
#Checking version in pandas
print(pd.__version__)

1.2.4


In [23]:
# display all pandas properties 
dir(pd)

['BooleanDtype',
 'Categorical',
 'CategoricalDtype',
 'CategoricalIndex',
 'DataFrame',
 'DateOffset',
 'DatetimeIndex',
 'DatetimeTZDtype',
 'ExcelFile',
 'ExcelWriter',
 'Flags',
 'Float32Dtype',
 'Float64Dtype',
 'Float64Index',
 'Grouper',
 'HDFStore',
 'Index',
 'IndexSlice',
 'Int16Dtype',
 'Int32Dtype',
 'Int64Dtype',
 'Int64Index',
 'Int8Dtype',
 'Interval',
 'IntervalDtype',
 'IntervalIndex',
 'MultiIndex',
 'NA',
 'NaT',
 'NamedAgg',
 'Period',
 'PeriodDtype',
 'PeriodIndex',
 'RangeIndex',
 'Series',
 'SparseDtype',
 'StringDtype',
 'Timedelta',
 'TimedeltaIndex',
 'Timestamp',
 'UInt16Dtype',
 'UInt32Dtype',
 'UInt64Dtype',
 'UInt64Index',
 'UInt8Dtype',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__docformat__',
 '__file__',
 '__getattr__',
 '__git_version__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_config',
 '_hashtable',
 '_is_numpy_dev',
 '_lib',
 '_libs',
 '_np_version_under1p17',
 '_np_version_under1p18',
 '_testing'

 #### Pandas can be grouped into two data structure catergories:
 * 1. Series
 * 2. Dataframe

## Pandas Series

* Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers,
* Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call

***Here is an illustrative examples of the total points obtained from the top six 2021/2022 European league winners.***


In [24]:
#initializing a variable t6_pts to store the points obtained
t6_pts = pd.Series([93,86,77,89,86,83])
t6_pts

0    93
1    86
2    77
3    89
4    86
5    83
dtype: int64

In [25]:
# Declaring a name to our series data
t6_pts.name = 'Top six league winners'
t6_pts

0    93
1    86
2    77
3    89
4    86
5    83
Name: Top six league winners, dtype: int64

In [26]:
#Displaying the datatype and values
t6_pts.dtype


dtype('int64')

In [27]:
t6_pts.values

array([93, 86, 77, 89, 86, 83])

In [28]:
#Displaying the first index 
t6_pts[0]


93

In [29]:
t6_pts.index

RangeIndex(start=0, stop=6, step=1)

In [30]:
#Matching various teams to their respective index

t6_pts.index = [
    
'Manchester City',
'Real Madrid',
'Bayern Muchen',
'AC Milan',
'PSG',
'Ajax',
    
    
    
    
    
    
]

t6_pts

Manchester City    93
Real Madrid        86
Bayern Muchen      77
AC Milan           89
PSG                86
Ajax               83
Name: Top six league winners, dtype: int64

In [31]:
#alternative way using t6_pts = pd.Series(data, index=index)

pd.Series(
    [93, 86, 77, 89, 86, 83],
    index=['Manchester City','Real Madrid', 'Bayern Muchen', 'AC Milan', 'PSG', 'Ajax'],
    name='Top six league winners')

Manchester City    93
Real Madrid        86
Bayern Muchen      77
AC Milan           89
PSG                86
Ajax               83
Name: Top six league winners, dtype: int64

In [32]:
#Displaying some part of the data 
pd.Series(t6_pts, index=['Real Madrid', 'Bayern Muchen', 'AC Milan', ])


Real Madrid      86
Bayern Muchen    77
AC Milan         89
Name: Top six league winners, dtype: int64

### Indexing 

In [33]:
#Dispalying the info about data
t6_pts

Manchester City    93
Real Madrid        86
Bayern Muchen      77
AC Milan           89
PSG                86
Ajax               83
Name: Top six league winners, dtype: int64

In [34]:
t6_pts[['Manchester City','Ajax']]

Manchester City    93
Ajax               83
Name: Top six league winners, dtype: int64

In [35]:
t6_pts.iloc[[0,1]]

Manchester City    93
Real Madrid        86
Name: Top six league winners, dtype: int64

In [17]:
t6_pts['Real Madrid':'PSG']

Real Madrid      86
Bayern Muchen    77
AC Milan         89
PSG              86
Name: Top six league winners, dtype: int64

In [36]:
t6_pts[0:1]

Manchester City    93
Name: Top six league winners, dtype: int64

### Conditional Boolean

In [47]:
t6_pts
t6_pts>70


85.66666666666667

In [50]:
t6_pts.mean()


5.428320796219276

In [49]:
t6_pts[t6_pts>t6_pts.mean()]

Manchester City    93
Real Madrid        86
AC Milan           89
PSG                86
Name: Top six league winners, dtype: int64

In [51]:
t6_pts.std()

5.428320796219276

In [52]:
t6_pts[(t6_pts > t6_pts.mean() - t6_pts.std() / 2) | (t6_pts> t6_pts.mean() + t6_pts.std() / 2)]

Manchester City    93
Real Madrid        86
AC Milan           89
PSG                86
Ajax               83
Name: Top six league winners, dtype: int64

### Operations and methods

In [53]:
t6_pts*100

Manchester City    9300
Real Madrid        8600
Bayern Muchen      7700
AC Milan           8900
PSG                8600
Ajax               8300
Name: Top six league winners, dtype: int64

In [54]:
np.log(t6_pts)

Manchester City    4.532599
Real Madrid        4.454347
Bayern Muchen      4.343805
AC Milan           4.488636
PSG                4.454347
Ajax               4.418841
Name: Top six league winners, dtype: float64

In [55]:
t6_pts['Real Madrid': 'PSG'].mean()

84.5

### Boolean Array

In [56]:
t6_pts
t6_pts>85

Manchester City     True
Real Madrid         True
Bayern Muchen      False
AC Milan            True
PSG                 True
Ajax               False
Name: Top six league winners, dtype: bool

In [57]:
t6_pts[t6_pts>85]

Manchester City    93
Real Madrid        86
AC Milan           89
PSG                86
Name: Top six league winners, dtype: int64