# Pandas
## 1. Intro to Pandas
 A data science package, very powerful of handling and manipulating data, used a lot for data munging (Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the Intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.), cleaning and data preprocessing from various sources like Excel files, CSV files, databases and wherever data is located.
* Pandas is built upon NumPy so most of the features in NumPy are also available in Pandas.
* The two most important objects of Pandas are Series & Data Frame.
* First of all, We have to import Pandas like this: import pandas as pd.
* To check the version of Pandas: print(pd.'2_underscores'version'2_underscores')
* In Series we can supply our own idecies (plural of index) and even then it'll take default values from 0 as a list, array ...etc in the background for implicit indexing.
* To print only the values we use Variable.values and its type is a numpy array wow!
* To print only the indecies we use Variable.index and its type is a pandas index object.
* Printing the whole Series returns something like a dictionary where each index 'key' has its own value and the type of it is a pandas series object.

In [1]:
import pandas as pd

In [2]:
print(pd.__version__)

2.0.3


In [3]:
A = pd.Series([2,3,4,5],index=['a','b','c','d'])
A.values

array([2, 3, 4, 5], dtype=int64)

In [4]:
type(A.values)

numpy.ndarray

In [5]:
A.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [6]:
type(A.index)

pandas.core.indexes.base.Index

In [7]:
A

a    2
b    3
c    4
d    5
dtype: int64

In [8]:
type(A)

pandas.core.series.Series

## 2. Series
An "array" in numpy is similar to a 'list' in python wherase a "series" in pandas is more like a 'dictionary' in python.
* We can access the pandas series just like we did with a dictionary by specifying the index Variable[index].
* Explicit slicing: the final index is (inclusive) __included__ by using indecies 'keys'.
* Implicit slicing: the final index is (exclusive) __not included__ by using numbers like we saw in NumPy.
* There are two ways to create a Series: 
  - The first is as we already done Variable = pd.Series(values,index=values)
  - The second way is to create a dictionary and then pass it to the Series: Variable= pd.Series(dictionary).

In [9]:
A['a']

2

In [10]:
A['a':'c'] #Explicit indecies

a    2
b    3
c    4
dtype: int64

In [11]:
A[0:2] #Implicit indecies

a    2
b    3
dtype: int64

In [12]:
D ={'Name':'Nadir','Surname':"Havertz",'A':1,'B':25}
S = pd.Series(D)
S

Name         Nadir
Surname    Havertz
A                1
B               25
dtype: object

In [13]:
type(S)

pandas.core.series.Series

In [14]:
grads_dict = {'A':4,'B':3.5,'C':3,'D':2.5}
grads = pd.Series(grads_dict)
grads

A    4.0
B    3.5
C    3.0
D    2.5
dtype: float64

In [15]:
marks_dict = {'A':85,'B':75,'C':65,'D':55}
marks = pd.Series(marks_dict)
marks

A    85
B    75
C    65
D    55
dtype: int64

In [16]:
marks['A']

85

In [17]:
marks[0:2]

A    85
B    75
dtype: int64

## 3. DataFrame
DataFrame can be looked alike an extension of Series to more than 1D array or dictionary where each key can have a bunch of values in a form of an array or a dictionary.
* A DataFrame also takes a dictionary however this time around each index has multiple values "an array, a dictionary or any other data structure of values" to display them in a table form.
* To use DataFrame, just like Series: np.DataFrame(D).
* We can apply some of NumPy funcs like:
  - Transpose: Variable.T to switch beween rows and columns (change the table's output).
* If a key within the values of a DataFrame's index is not present in the other index(indecies)'s values keys it'll put it as a NaN value or non-type value for the other index(indecies). However, there are multiple methods to deal with NaN values in Pandas:
  - Variable.fillna(Number): remplaces NaN values with that specific number most of the time it's 0.
  - Variable.dropna(): drops columns with NaN values it's not practical since it results in data loss.
  #### Note there are methods in scikit-learn which can deal with this situation (handling missing data "NaN values") where for example it calculates the average for each row based on columns, regression ...etc.
* To print the columns we can use Variable.columns,and the same thing goes here for values, indecies and even specific values: Variable.values[index,index].
* Just like a dict a DataFrame is changeable(mutable) we can add a column and specify its value, update its value (which is a bit different than a dict by using 'at' or '.loc': Variable.at[Variable.index[Index'sValue],'ColumnName']=NewValue) and even delete it.
* Masking(index_array) & Bolean_array don't change the value of the original DataFrame when changing it in the new one like in NumPy.

In [18]:
grads

A    4.0
B    3.5
C    3.0
D    2.5
dtype: float64

In [19]:
marks

A    85
B    75
C    65
D    55
dtype: int64

In [20]:
DF = pd.DataFrame({'Marks':marks,'Grades':grads})
DF

Unnamed: 0,Marks,Grades
A,85,4.0
B,75,3.5
C,65,3.0
D,55,2.5


In [21]:
DF.T

Unnamed: 0,A,B,C,D
Marks,85.0,75.0,65.0,55.0
Grades,4.0,3.5,3.0,2.5


In [22]:
T = pd.DataFrame({'Marks':{'A':85,'B':75,'C':40},'Grades':{'D':29,'E':23,'F':7}})
T

Unnamed: 0,Marks,Grades
A,85.0,
B,75.0,
C,40.0,
D,,29.0
E,,23.0
F,,7.0


In [23]:
DSX = pd.DataFrame([{'a':1,'b':4},{'b':-3,'c':9}])
DSX

Unnamed: 0,a,b,c
0,1.0,4,
1,,-3,9.0


In [24]:
DSX.fillna(0)

Unnamed: 0,a,b,c
0,1.0,4,0.0
1,0.0,-3,9.0


In [25]:
DSX.dropna()

Unnamed: 0,a,b,c


In [26]:
DF

Unnamed: 0,Marks,Grades
A,85,4.0
B,75,3.5
C,65,3.0
D,55,2.5


In [27]:
DF.values

array([[85. ,  4. ],
       [75. ,  3.5],
       [65. ,  3. ],
       [55. ,  2.5]])

In [28]:
DF.values[2,0]

65.0

In [29]:
DF.index

Index(['A', 'B', 'C', 'D'], dtype='object')

In [30]:
DF.columns

Index(['Marks', 'Grades'], dtype='object')

In [31]:
DF['ScaleMarks']=round(100*(DF['Marks']/90),1)
DF

Unnamed: 0,Marks,Grades,ScaleMarks
A,85,4.0,94.4
B,75,3.5,83.3
C,65,3.0,72.2
D,55,2.5,61.1


In [32]:
del DF['ScaleMarks'] #Bolean array!
DF

Unnamed: 0,Marks,Grades
A,85,4.0
B,75,3.5
C,65,3.0
D,55,2.5


In [33]:
G = DF[DF['Marks']>70]
G

Unnamed: 0,Marks,Grades
A,85,4.0
B,75,3.5


In [34]:
G.at[G.index[0],'Marks']=90
G

Unnamed: 0,Marks,Grades
A,90,4.0
B,75,3.5


In [35]:
DF #No change 85 not 90 so copy() thanks to Bolean array "Fancy indexing"!

Unnamed: 0,Marks,Grades
A,85,4.0
B,75,3.5
C,65,3.0
D,55,2.5


## 4. Indexing & slicing
As I already mentioned before we can supply our own idecies and then use them for explicit slicing, but what if those indecies were numbers like 'int' a confusion takes place and in order to solve that, we can use Variable.loc[index1,index2] for explicit indecies & Variable2.iloc[index1,index2] for implicit indecies and iloc is useless since we're using numbers we automatically get implicit slicing.
* We can also use those funcs with DataFrames to access values from rows & columns and even reverse them each one on its own or even both but if we use letter as explicit indecies there is no need for loc only when we use numbers.

In [36]:
data = pd.Series(['a','b','c'], index=[1,3,5])

In [37]:
data[1]

'a'

In [38]:
data[1:3] #I wanted explicit indecies but got implicit ones

3    b
5    c
dtype: object

In [39]:
data.loc[1:3] #explicit indecies thanks to loc

1    a
3    b
dtype: object

In [40]:
data.iloc[1:3] #implicit indecies using iloc

3    b
5    c
dtype: object

In [41]:
DF

Unnamed: 0,Marks,Grades
A,85,4.0
B,75,3.5
C,65,3.0
D,55,2.5


In [42]:
DF.iloc[2,:] #values from rows using implicit index

Marks     65.0
Grades     3.0
Name: C, dtype: float64

In [43]:
DF.loc['C',:] #values from rows using explicit index

Marks     65.0
Grades     3.0
Name: C, dtype: float64

In [44]:
DF.iloc[:,1] #values from columns using implicit index

A    4.0
B    3.5
C    3.0
D    2.5
Name: Grades, dtype: float64

In [45]:
DF.loc[:,"Grades"] #values from columns using explicit index

A    4.0
B    3.5
C    3.0
D    2.5
Name: Grades, dtype: float64

In [46]:
DF.iloc[::-1,:] #reverse rows

Unnamed: 0,Marks,Grades
D,55,2.5
C,65,3.0
B,75,3.5
A,85,4.0


In [47]:
DF.iloc[:,::-1] #reverse columns

Unnamed: 0,Grades,Marks
A,4.0,85
B,3.5,75
C,3.0,65
D,2.5,55


In [48]:
DF.iloc[::-1,::-1] #reverse both rows & columns

Unnamed: 0,Grades,Marks
D,2.5,55
C,3.0,65
B,3.5,75
A,4.0,85


## 5. Working on data files
* To import a file : use Variable = pd.read_TypeOfFile(path: use ' ' or " " with this slash / and don't forget the extention of that file for.
* Hear are some useful funcs:
  - head() & head(N): shows the 5 heading records by default and we can specify the number of records by passing a number.
  - tail() & tail(N): shows the 5 tailing (last) records by default and we can also specify the number of records by passing a number.
  - If we only type the variable alone we get all records.
  - Variable.columns: to get all data types of each column.
  - Variable.dtype: to get all columns of a data frame.
  - drop(['column1','column2'...],axis=1,inplace=True or False): we choose the columns we want to drop, the axis=1 means do 
    that with columns and inplace True so that this drop takes effect on the Variable if false it won't.
  - rename(columns={'OldColumn':'NewColumn'}, inplace=True or False): to rename a(/) column(s) using a dictionary.
  - Variable['Column'] = pd.to_datetime(Variable['Column']): to convert column values type to datetime.
  - describe(): to decribe statistics of a data frame table.
  - describe(include='object') in order to get statistics on non integer/float values.
  - info(): to get info about the columns of a data frame and the type of the values of each column as well as NaN values count.
  - groupby(['Column1','Column2'])[['Column3','Column4','Column5']].sum().reset_index(): do the grouping by the columns in the the first array whereas for the other columns (in the second array) sum them and start a new index.
  - sum(): useful especially in groupby.
  - unique(): to gather unique values.
  - Filtering on Columns we can use:
    * df.ColumnName.
    * df['ColumnName'] if the name has spaces.
    * df[['Column1','Column2']] for multiple colums.
  - Filtering on Rows we can use:
    * df[df['ColumnName']=='Value'].
    * df[(df['Column1']=='Value') & (df['Column2']==Value)]
  - iloc for implicit indexing using numbers (if the indecies are auto made they're like explicit):
    * df.iloc[Number]: to get the value of each property for that specific row.
    * df.iloc[Row,Colum]: for specific value useful for updating.
    * df.iloc[StartRow:EndRow]: for a range of rows.
  - loc for explicit indexing using column names:
    * Column.loc["Value']
  - df['NewColumn'] = .f['ExisitingColumn'].apply(lambda x: 1 if x=='Value' else 0) for condition based Updating using Apply.
  - df.to_TypeOfFile('FileName.extention') this gets us that specific type of file in the same directory as Jupter Notebook's: C:\Users\Nadir.