### Prepping Your Data with Pandas
![image.png](attachment:c37e6409-4849-454c-bcc5-b2217b4d210c.png)

- learn objective:
-  the building blocks of the Pandas 
        - Series, 
        - DataFrames,and
        - Indexes, 
 - Understand the various functions in this library that are used to tidy,cleanse, merge, and aggregate data in Pandas. 
 - develop the skills necessary for preparing your data.

In [1]:
import pandas as pd 
pd.__version__

'1.4.0'

In [139]:
# import pandas as pd
# print(dir(pd),end=" ")

In [138]:
import pandas as pd
data=pd.read_csv("https://raw.githubusercontent.com/DataRepo2019/Data-files/master/subset-covid-data.csv")
data.head()

In [34]:
data.tail()

Unnamed: 0,country,continent,date,day,month,year,cases,deaths,country_code,population
201,Venezuela,America,2020-04-12,12,4,2020,0,0,VEN,28870195.0
202,Vietnam,Asia,2020-04-12,12,4,2020,4,0,VNM,95540395.0
203,Yemen,Asia,2020-04-12,12,4,2020,0,0,YEM,28498687.0
204,Zambia,Africa,2020-04-12,12,4,2020,0,0,ZMB,17351822.0
205,Zimbabwe,Africa,2020-04-12,12,4,2020,3,0,ZWE,14439018.0


In [36]:
type(data['country'])

pandas.core.series.Series

In [37]:
import pandas as pd
type(pd.DataFrame())


pandas.core.frame.DataFrame

In [38]:
type(pd.Series())

  type(pd.Series())


pandas.core.series.Series

### Pandas at a glance

- Wes McKinney developed the Pandas library in 2008.
- The name (Pandas) comes from the term “Panel Data” used in econometrics for analyzing time-series data.


### Pandas features:

1. Pandas provides features for labeling of data or indexing, which speeds up the retrieval of data.

2. Input and output support: Pandas provides options to read data from different file formats like 
    - JSON (JavaScript Object Notation),
    - CSV (Comma-Separated Values), 
    - Excel, and 
    - HDF5 (Hierarchical Data Format Version 5). 
    - It can also be used to write data into databases, web services, and so on.

3. Most of the data that is needed for analysis is not contained in a single source, and we often need to combine datasets to consolidate the data that we need for analysis. Again, Pandas comes to the rescue with tailor-made functions to combine data.

 4. Speed and enhanced performance: 
    - The Pandas library is based on Cython, which combines the convenience and ease of use of Python with the speed of the C language. Cython helps to optimize performance and reduce overheads.

5. Data visualization: 
    - To derive insights from the data and make it presentable to the audience, viewing data using visual means is crucial, and Pandas provides a lot of built-in visualization tools using Matplotlib as the base library.

6. Support for other libraries: 
    - Pandas integrates smoothly with other libraries like Numpy, Matplotlib, Scipy, and Scikit-learn.Thus we can perform other tasks like numerical computations, visualizations, statistical analysis, and machine learning in conjunction with data manipulation.

7. Grouping: 
    - Pandas provides support for the split-apply-combine methodology, whereby we can group our data into categories, apply separate functions on them, and combine the results.

8. Handling missing data, duplicates, and filler characters: 
    - Data often has missing values, duplicates, blank spaces, special characters (like $, &), and so on that may need to be removed or replaced. With the functions provided in Pandas, you can handle such anomalies with ease.

9. Mathematical operations:
    - Many numerical operations and computations can be performed in Pandas, with NumPy being used at the back end for this purpose.

### Technical requirements

- If you have not already installed Pandas, go to the Anaconda Prompt/Terminal  and enter the following command.
- ``` >>>pip install pandas```

- import pandas as pd
- Here, pd is a shorthand name or alias that is a standard for Pandas.


In [145]:
import pandas as pd
print(dir(pd),end="")

['BooleanDtype', 'Categorical', 'CategoricalDtype', 'CategoricalIndex', 'DataFrame', 'DateOffset', 'DatetimeIndex', 'DatetimeTZDtype', 'ExcelFile', 'ExcelWriter', 'Flags', 'Float32Dtype', 'Float64Dtype', 'Float64Index', 'Grouper', 'HDFStore', 'Index', 'IndexSlice', 'Int16Dtype', 'Int32Dtype', 'Int64Dtype', 'Int64Index', 'Int8Dtype', 'Interval', 'IntervalDtype', 'IntervalIndex', 'MultiIndex', 'NA', 'NaT', 'NamedAgg', 'Period', 'PeriodDtype', 'PeriodIndex', 'RangeIndex', 'Series', 'SparseDtype', 'StringDtype', 'Timedelta', 'TimedeltaIndex', 'Timestamp', 'UInt16Dtype', 'UInt32Dtype', 'UInt64Dtype', 'UInt64Index', 'UInt8Dtype', '__all__', '__builtins__', '__cached__', '__deprecated_num_index_names', '__dir__', '__doc__', '__docformat__', '__file__', '__getattr__', '__git_version__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_config', '_is_numpy_dev', '_libs', '_testing', '_typing', '_version', 'api', 'array', 'arrays', 'bdate_range', 'compat', 'concat

### Building blocks of Pandas
- The Series and DataFrame class  objects are the underlying data structures in Pandas. 
- In a nutshell, a Series is like a column (has only one dimension),
- and a DataFrame (has two dimensions ) is like a table or a spreadsheet with rows and columns. 
- Each value stored in a Series or a DataFrame has a label or an index attached to it, which speeds up retrieval and access to data. 

### Creating a Series object
- The Series is a one-dimensional object, with a set of values and their associated indexes.

### Using a scalar value


In [146]:
#creating a Series using a scalar value:
import pandas as pd
x=pd.Series(4)
print(x)
print(type(x))
#Creating a simple series with just one value. Here, 0 is the index label, and 2 is the value the Series object contains.

0    4
dtype: int64
<class 'pandas.core.series.Series'>


In [148]:
a=10
print(type(a))

<class 'int'>


In [157]:
# Using a list
ls=pd.Series([2,3,1,2,566,76])
#Creating a series by enclosing a single value (2) in a list and replicating it 5 times. 0,1,2,3,4 are the
#autogenerated index labels.
print(ls)
print(type(ls))

0      2
1      3
2      1
3      2
4    566
5     76
dtype: int64
<class 'pandas.core.series.Series'>


In [152]:
#Using characters in a string
pd.Series(list('hello'),index =["a","b","c","d","e"])
#Creating a series by using each character in the string "hello" as a separate value in the Series.

a    h
b    e
c    l
d    l
e    o
dtype: object

In [166]:
#Using a dictionary
pd.Series({1:'India',2:'Japan',3:'Singapore'})
#The key/value pairs correspond to the index labels and values in the Series object.

1        India
2        Japan
3    Singapore
dtype: object

In [167]:
#Using a range
import numpy as np
pd.Series(np.arange(1,5))
#Using the NumPy asciiarrange function to create a series from a range of 4 numbers (1-4), ensure that the
#NumPy library is also imported

0    1
1    2
2    3
3    4
dtype: int32

In [177]:
import numpy as np
data=np.genfromtxt("climate.txt",delimiter=',',skip_header=1)
data.shape

(10000, 3)

In [170]:
data[:,0]

array([25., 39., 59., ..., 99., 70., 92.])

In [1]:
import numpy as np
temp_data=pd.Series(data)
type(temp_data)

NameError: name 'pd' is not defined

In [174]:
#Using random numbers
pd.Series(np.random.normal(size=5))
#Creating a set of 4 random numbers using the np.random.normal function

0    0.664566
1   -0.196134
2    1.451272
3   -1.753180
4    1.983111
dtype: float64

In [175]:
#Creating a series with customized index labels
pd.Series([2,0,1,6],index=['a','b','c','d'])
#The list [2,0,1,6] specifies the values in the series, and the list for the index['a','b','c','d'] specifies the index labels

a    2
b    0
c    1
d    6
dtype: int64

create a Series object from a single (scalar) value, list, dictionary, a set of random numbers, or a range of numbers. The pd.Series function creates a Series
object (note that the letter “S” in “Series” is in uppercase; pd.series will not work). Use the index parameter if you want to customize the index.

### Examining the properties of a Series
- a Series object like the number of elements, its values, and unique elements.

### Finding out the number of elements in a Series
- There are three ways of finding the number of elements a Series contains: 
     - using the size parameter, the len function, or the shape parameter


In [179]:
#series definition
x=pd.Series(np.arange(1,10))
#using the size attribute
print(x)
x.size

0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
dtype: int32


9

In [64]:
#builtin function 
len(x)

9

In [49]:
#The shape attribute returns a tuple with the number of rows and columns
x.shape

(9,)

### Listing the values of the individual elements in a Series
- The values attribute returns a NumPy array containing the values of each item in the Series.

In [50]:
x.values

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [184]:
x.index

RangeIndex(start=0, stop=9, step=1)

In [183]:
temp_data.value_counts()  # distinict element 

40.0    151
37.0    148
72.0    146
35.0    145
80.0    144
       ... 
63.0    103
83.0    101
25.0    101
47.0    100
34.0     98
Length: 81, dtype: int64

In [72]:
ixs=pd.Series([2,0,1,6],index=['a','b','c','d'])

In [185]:
temp_data.index

RangeIndex(start=0, stop=10000, step=1)

In [186]:
ixs.index

Index(['a', 'b', 'c', 'd'], dtype='object')

### Accessing the index of a Series
- The index of the Series can be accessed through the index attribute. 
- An index is an object with a data type and a set of values. 
- The default type for an index object is RangeIndex.


In [70]:
x.index

RangeIndex(start=0, stop=9, step=1)

In [71]:
temp_data.index

RangeIndex(start=0, stop=10000, step=1)

### Obtaining the unique values in a Series and their count
- The value_counts() method is an important method. 
- When used with a Series object, it displays the unique values contained in this object and the count of each of these unique values. It is a common practice to use this method with a categorical variable, to get an idea of the distinct values it contains.


In [74]:

z=pd.Series(['a','b','a','c','d','b'])
z.value_counts()

a    2
b    2
c    1
d    1
dtype: int64

### Method chaining for a Series
- We can apply multiple methods to a series and apply them successively. 
- This is called method chaining and can be applied for both Series and DataFrame objects.
- Example:
    - Suppose we want to find out the number of times the values “a” and “b” occur for the series “z” defined in the following. We can combine the value_counts method and the head method by chaining them.

In [188]:
z=pd.Series(['a','b','a','c','d','b'])
z.value_counts().head(2) # defaul is 5  , n=5  # int 

a    2
b    2
dtype: int64

- If multiple methods need to be changed together and applied on a Series object, it is better to mention each method on a separate line, with each line ending with a backslash. 
- It would make the code more readable, as shown in the following.


In [192]:
z.value_counts()\
.head(3)\
.values

array([2, 2, 1], dtype=int64)

In [202]:
x=pd.Series([10,20,30,4.4], index=np.arange(1,5), name="tens", copy=False, fastpath=False)

In [90]:
x.name

'tens'

In [203]:
x

1    10.0
2    20.0
3    30.0
4     4.4
Name: tens, dtype: float64

If you want to learn more about the Series object and the methods used with Series objects, refer to the following link
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

## DataFrames
- A DataFrame is an extension of a Series. 
- It is a two-dimensional data structure for storing data. 
- While the Series object contains two components - a set of values, and index labels attached to these values 
- the DataFrame object contains three components - thecolumn object, index object, and a NumPy array object that contains the values.
- The index and columns are collectively called the axes. The index forms the axis “0” and the columns form the axis “1”

### Different Methods for Creating a DataFrame

In [None]:
class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

In [213]:
#By combining Series objects
student_ages=pd.Series([22,24,20]) #series 1
teacher_ages=pd.Series([40,50,45])#series 2
combined_ages=pd.DataFrame((student_ages,teacher_ages)) #DataFrame
combined_ages.columns=['class 1','class 2','class 3']#naming columns
combined_ages

Unnamed: 0,class 1,class 2,class 3
0,22,24,20
1,40,50,45


In [215]:
#From a dictionary CODE:
combined_ages=pd.DataFrame({'class 1':[22,40,40],'class2':[24,50,45],'class 3':[20,45,56]})
type(combined_ages)

pandas.core.frame.DataFrame

In [217]:
#From a numpy array CODE:
numerical_df=pd.DataFrame(np.arange(1,9).reshape(2,4),columns=["a","b","c","d"])
#numerical_df.columns=["a","b","c","d"]
numerical_df

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,5,6,7,8


In [221]:
df=pd.DataFrame(data,columns=["temp","rainfall","hum"])

In [228]:
df[['temp','hum']]

Unnamed: 0,temp,hum
0,25.0,99.0
1,39.0,70.0
2,59.0,77.0
3,84.0,38.0
4,66.0,52.0
...,...,...
9995,80.0,98.0
9996,27.0,60.0
9997,99.0,58.0
9998,70.0,91.0


In [229]:
#Using a set of tuples CODE:
combined_ages=pd.DataFrame([(22,24,20),(40,50,45)],columns=['class 1','class 2','class 3'])
combined_ages

Unnamed: 0,class 1,class 2,class 3
0,22,24,20
1,40,50,45


We have re-created the “combined_ages” DataFrame using a set of tuples. Each tuple is equivalent to a row in a DataFrame.

create a DataFrame using a dictionary, a set of tuples, and by combining Series objects. Each of these methods uses the pd.DataFrame function. Note that the characters “D” and “F” in this method are in uppercase; pd.dataframe does not work.

### Creating DataFrames by importing data from other formats

- Pandas can read data from a wide variety of formats using its reader functions (refer to the complete list of supported formats here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html). The following are some of the commonly used formats.

### From a CSV file:
- The read_csv function can be used to read data from a CSV file into a DataFrame, as shown in the following

In [235]:
import pandas as pd
olympics_data=pd.read_csv("https://raw.githubusercontent.com/svkarthik86/Advance-Python-Numpy/123da03bbfbb71ba3264bf8133c86ef632435022/olympics.csv",skiprows=4)

In [236]:
olympics_data

Unnamed: 0,City,Edition,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
0,Athens,1896,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100m freestyle,M,Gold
1,Athens,1896,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100m freestyle,M,Silver
2,Athens,1896,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100m freestyle for sailors,M,Bronze
3,Athens,1896,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100m freestyle for sailors,M,Gold
4,Athens,1896,Aquatics,Swimming,"CHASAPIS, Spiridon",GRE,Men,100m freestyle for sailors,M,Silver
...,...,...,...,...,...,...,...,...,...,...
29211,Beijing,2008,Wrestling,Wrestling Gre-R,"ENGLICH, Mirko",GER,Men,84 - 96kg,M,Silver
29212,Beijing,2008,Wrestling,Wrestling Gre-R,"MIZGAITIS, Mindaugas",LTU,Men,96 - 120kg,M,Bronze
29213,Beijing,2008,Wrestling,Wrestling Gre-R,"PATRIKEEV, Yuri",ARM,Men,96 - 120kg,M,Bronze
29214,Beijing,2008,Wrestling,Wrestling Gre-R,"LOPEZ, Mijain",CUB,Men,96 - 120kg,M,Gold


In [237]:
type(olympics_data)

pandas.core.frame.DataFrame

In [238]:
olympics_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29216 entries, 0 to 29215
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   City          29216 non-null  object
 1   Edition       29216 non-null  int64 
 2   Sport         29216 non-null  object
 3   Discipline    29216 non-null  object
 4   Athlete       29216 non-null  object
 5   NOC           29216 non-null  object
 6   Gender        29216 non-null  object
 7   Event         29216 non-null  object
 8   Event_gender  29216 non-null  object
 9   Medal         29216 non-null  object
dtypes: int64(1), object(9)
memory usage: 2.2+ MB


In [242]:
olympics_data['City'].values

array(['Athens', 'Athens', 'Athens', ..., 'Beijing', 'Beijing', 'Beijing'],
      dtype=object)

In [249]:
olympics_data['Gender'].value_counts()

Men      21721
Women     7495
Name: Gender, dtype: int64

In [2]:
import pandas as pd
src=r"D:\Edubridge\Associate-Data Analytics\Advancepython\Data-files-master\Data-files-master\subset-covid-data.csv"
covid19_data=pd.read_csv(src)  # read data and converted to DataFrame object
covid19_data  # DataFrame object

Unnamed: 0,country,continent,date,day,month,year,cases,deaths,country_code,population
0,Afghanistan,Asia,2020-04-12,12,4,2020,34,3,AFG,37172386.0
1,Albania,Europe,2020-04-12,12,4,2020,17,0,ALB,2866376.0
2,Algeria,Africa,2020-04-12,12,4,2020,64,19,DZA,42228429.0
3,Andorra,Europe,2020-04-12,12,4,2020,21,2,AND,77006.0
4,Angola,Africa,2020-04-12,12,4,2020,0,0,AGO,30809762.0
...,...,...,...,...,...,...,...,...,...,...
201,Venezuela,America,2020-04-12,12,4,2020,0,0,VEN,28870195.0
202,Vietnam,Asia,2020-04-12,12,4,2020,4,0,VNM,95540395.0
203,Yemen,Asia,2020-04-12,12,4,2020,0,0,YEM,28498687.0
204,Zambia,Africa,2020-04-12,12,4,2020,0,0,ZMB,17351822.0


In [111]:
type(covid19_data)

pandas.core.frame.DataFrame

In [114]:
type(covid19_data[['country_code','country']])

pandas.core.frame.DataFrame

In [116]:
covid19_data[['country_code','country']]

Unnamed: 0,country_code,country
0,AFG,Afghanistan
1,ALB,Albania
2,DZA,Algeria
3,AND,Andorra
4,AGO,Angola
...,...,...
201,VEN,Venezuela
202,VNM,Vietnam
203,YEM,Yemen
204,ZMB,Zambia


In [125]:
covid19_data.loc[:5]

Unnamed: 0,country,continent,date,day,month,year,cases,deaths,country_code,population
0,Afghanistan,Asia,2020-04-12,12,4,2020,34,3,AFG,37172386.0
1,Albania,Europe,2020-04-12,12,4,2020,17,0,ALB,2866376.0
2,Algeria,Africa,2020-04-12,12,4,2020,64,19,DZA,42228429.0
3,Andorra,Europe,2020-04-12,12,4,2020,21,2,AND,77006.0
4,Angola,Africa,2020-04-12,12,4,2020,0,0,AGO,30809762.0
5,Anguilla,America,2020-04-12,12,4,2020,0,0,,


Reading data from CSV files is one of the most common ways to create a DataFrame. CSV files are comma-separated files for storing and retrieving values, where each line is equivalent to a row. 
Remember to upload the CSV file in Jupyter using the upload button on the Jupyter home page , before calling the “read_csv” function.

### From an Excel file:
- Pandas provides support for importing data from both xls and xlsx file formats using the pd.read_excel function.


import pandas as pd
 src=r"D:\Edubridge\Associate-Data Analytics\Advancepython\Data-files-master\Data-files-master\COVID-19.xlsx"
Covid19_excel=pd.read_excel(src)

In [257]:
import pandas as pd 
src=r"D:\Edubridge\Associate-Data Analytics\Advancepython\Data-files-master\Data-files-master\COVID-19.xlsx"
covid19_excel=pd.read_excel(src)
covid19_excel

Unnamed: 0,dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2019,continentExp
0,2020-06-29,29,6,2020,351,18,Afghanistan,AF,AFG,38041757.0,Asia
1,2020-06-28,28,6,2020,165,20,Afghanistan,AF,AFG,38041757.0,Asia
2,2020-06-27,27,6,2020,276,8,Afghanistan,AF,AFG,38041757.0,Asia
3,2020-06-26,26,6,2020,460,36,Afghanistan,AF,AFG,38041757.0,Asia
4,2020-06-25,25,6,2020,234,21,Afghanistan,AF,AFG,38041757.0,Asia
...,...,...,...,...,...,...,...,...,...,...,...
26557,2020-03-25,25,3,2020,0,0,Zimbabwe,ZW,ZWE,14645473.0,Africa
26558,2020-03-24,24,3,2020,0,1,Zimbabwe,ZW,ZWE,14645473.0,Africa
26559,2020-03-23,23,3,2020,0,0,Zimbabwe,ZW,ZWE,14645473.0,Africa
26560,2020-03-22,22,3,2020,1,0,Zimbabwe,ZW,ZWE,14645473.0,Africa


In [252]:
type(covid19_excel)

pandas.core.frame.DataFrame

### From a JSON file:
- JSON stands for JavaScript Object Notation and is a cross-platform file format for transmitting and exchanging data between the client and server. Pandas provides the function read_json to read data from a JSON file.

In [259]:
json_data=pd.read_json('https://raw.githubusercontent.com/svkarthik86/Advance-Python-Numpy/main/countries.json')
json_data

Unnamed: 0,id,name,iso3,iso2,numeric_code,phone_code,capital,currency,currency_name,currency_symbol,tld,native,region,subregion,timezones,translations,latitude,longitude,emoji,emojiU
0,1,Afghanistan,AFG,AF,4,93,Kabul,AFN,Afghan afghani,؋,.af,افغانستان,Asia,Southern Asia,"[{'zoneName': 'Asia/Kabul', 'gmtOffset': 16200...","{'kr': '아프가니스탄', 'br': 'Afeganistão', 'pt': 'A...",33.000000,65.0,🇦🇫,U+1F1E6 U+1F1EB
1,2,Aland Islands,ALA,AX,248,+358-18,Mariehamn,EUR,Euro,€,.ax,Åland,Europe,Northern Europe,"[{'zoneName': 'Europe/Mariehamn', 'gmtOffset':...","{'kr': '올란드 제도', 'br': 'Ilhas de Aland', 'pt':...",60.116667,19.9,🇦🇽,U+1F1E6 U+1F1FD
2,3,Albania,ALB,AL,8,355,Tirana,ALL,Albanian lek,Lek,.al,Shqipëria,Europe,Southern Europe,"[{'zoneName': 'Europe/Tirane', 'gmtOffset': 36...","{'kr': '알바니아', 'br': 'Albânia', 'pt': 'Albânia...",41.000000,20.0,🇦🇱,U+1F1E6 U+1F1F1
3,4,Algeria,DZA,DZ,12,213,Algiers,DZD,Algerian dinar,دج,.dz,الجزائر,Africa,Northern Africa,"[{'zoneName': 'Africa/Algiers', 'gmtOffset': 3...","{'kr': '알제리', 'br': 'Argélia', 'pt': 'Argélia'...",28.000000,3.0,🇩🇿,U+1F1E9 U+1F1FF
4,5,American Samoa,ASM,AS,16,+1-684,Pago Pago,USD,US Dollar,$,.as,American Samoa,Oceania,Polynesia,"[{'zoneName': 'Pacific/Pago_Pago', 'gmtOffset'...","{'kr': '아메리칸사모아', 'br': 'Samoa Americana', 'pt...",-14.333333,-170.0,🇦🇸,U+1F1E6 U+1F1F8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,243,Wallis And Futuna Islands,WLF,WF,876,681,Mata Utu,XPF,CFP franc,₣,.wf,Wallis et Futuna,Oceania,Polynesia,"[{'zoneName': 'Pacific/Wallis', 'gmtOffset': 4...","{'kr': '왈리스 푸투나', 'br': 'Wallis e Futuna', 'pt...",-13.300000,-176.2,🇼🇫,U+1F1FC U+1F1EB
246,244,Western Sahara,ESH,EH,732,212,El-Aaiun,MAD,Moroccan Dirham,MAD,.eh,الصحراء الغربية,Africa,Northern Africa,"[{'zoneName': 'Africa/El_Aaiun', 'gmtOffset': ...","{'kr': '서사하라', 'br': 'Saara Ocidental', 'pt': ...",24.500000,-13.0,🇪🇭,U+1F1EA U+1F1ED
247,245,Yemen,YEM,YE,887,967,Sanaa,YER,Yemeni rial,﷼,.ye,اليَمَن,Asia,Western Asia,"[{'zoneName': 'Asia/Aden', 'gmtOffset': 10800,...","{'kr': '예멘', 'br': 'Iêmen', 'pt': 'Iémen', 'nl...",15.000000,48.0,🇾🇪,U+1F1FE U+1F1EA
248,246,Zambia,ZMB,ZM,894,260,Lusaka,ZMW,Zambian kwacha,ZK,.zm,Zambia,Africa,Eastern Africa,"[{'zoneName': 'Africa/Lusaka', 'gmtOffset': 72...","{'kr': '잠비아', 'br': 'Zâmbia', 'pt': 'Zâmbia', ...",-15.000000,30.0,🇿🇲,U+1F1FF U+1F1F2


In [260]:
json_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               250 non-null    int64  
 1   name             250 non-null    object 
 2   iso3             250 non-null    object 
 3   iso2             250 non-null    object 
 4   numeric_code     250 non-null    int64  
 5   phone_code       250 non-null    object 
 6   capital          250 non-null    object 
 7   currency         250 non-null    object 
 8   currency_name    250 non-null    object 
 9   currency_symbol  250 non-null    object 
 10  tld              250 non-null    object 
 11  native           249 non-null    object 
 12  region           250 non-null    object 
 13  subregion        250 non-null    object 
 14  timezones        250 non-null    object 
 15  translations     250 non-null    object 
 16  latitude         250 non-null    float64
 17  longitude       

### From an HTML file:
- We can also import data from a web page using the pd.read_html function.
- this function parses the tables on the web page into DataFrame objects. 
- This function returns a list of DataFrame objects which correspond to the tables on the web page. 
- In the following example, table[0] corresponds to the first table on the mentioned URL.


In [27]:
# url="https://www.w3schools.com/sql/sql_create_table.asp"
# table=pd.read_html(url)
# table[0]

In [43]:
import pandas as pd
html_data=pd.read_html("test.html",flavor="html5lib")
html_data

[   Unnamed: 0  name  physics  chemistry  algebra
 0           0  Somu       68         84       78
 1           1  Kiku       74         56       88
 2           2  Amol       77         73       82
 3           3  Lini       78         69       87]

Further reading: See the complete list of supported formats in Pandas and the functions
for reading data from such formats:
https://pandas.pydata.org/pandas-docs/stable/reference/io.html

### Accessing attributes in a DataFrame
- to access the attributes in a DataFrame object.
- SYNTAX
```
<DataFrameObject>. <attribute_name> ```


![image.png](attachment:71625f29-ccd1-42fe-8a5c-eec5b514c3fc.png)

In [3]:
import pandas as pd
combined_ages=pd.DataFrame({'class 1':[22,40],'class 2':[24,50],'class 3':[20,45]},index=('a','b'))
combined_ages

Unnamed: 0,class 1,class 2,class 3
a,22,24,20
b,40,50,45


In [8]:
x=pd.DataFrame(np.arange(10).reshape(-1,2))

In [10]:
x.columns

RangeIndex(start=0, stop=2, step=1)

### index

In [6]:
combined_ages.columns

Index(['class 1', 'class 2', 'class 3'], dtype='object')

In [13]:
covid19_data.columns

Index(['country', 'continent', 'date', 'day', 'month', 'year', 'cases',
       'deaths', 'country_code', 'population'],
      dtype='object')

### columns

In [52]:
#The columns attribute gives you information about the columns (their names and data type).

combined_ages.columns

Index(['class 1', 'class 2', 'class 3'], dtype='object')

In [53]:
covid19_data.columns

Index(['country', 'continent', 'date', 'day', 'month', 'year', 'cases',
       'deaths', 'country_code', 'population'],
      dtype='object')

- The index object and column object are both types of index objects.
- While the index object has a type RangeIndex, 
- the columns object has a type “Index”. 
- The values of the index object act as row labels, while those of the column object act as column labels.

### Accessing the values in the DataFrame
- Using the values attribute, you can obtain the data stored in the DataFrame. 



In [14]:
combined_ages.values


array([[22, 24, 20],
       [40, 50, 45]], dtype=int64)

In [8]:
covid19_data.values

array([['Afghanistan', 'Asia', '2020-04-12', ..., 3, 'AFG', 37172386.0],
       ['Albania', 'Europe', '2020-04-12', ..., 0, 'ALB', 2866376.0],
       ['Algeria', 'Africa', '2020-04-12', ..., 19, 'DZA', 42228429.0],
       ...,
       ['Yemen', 'Asia', '2020-04-12', ..., 0, 'YEM', 28498687.0],
       ['Zambia', 'Africa', '2020-04-12', ..., 0, 'ZMB', 17351822.0],
       ['Zimbabwe', 'Africa', '2020-04-12', ..., 0, 'ZWE', 14439018.0]],
      dtype=object)

In [9]:
combined_ages

Unnamed: 0,class 1,class 2,class 3
a,22,24,20
b,40,50,45


In [10]:
combined_ages.value_counts()

class 1  class 2  class 3
22       24       20         1
40       50       45         1
dtype: int64

In [12]:
covid19_data['continent'].value_counts()

Europe     54
Africa     52
America    49
Asia       42
Oceania     8
Other       1
Name: continent, dtype: int64

### AXES
 
- This attribute is used to fetch both index and column names.
 
- SYNTAX
 ```
<DataFrameObject>. <axes>  ```

In [60]:
combined_ages.axes

[RangeIndex(start=0, stop=2, step=1),
 Index(['class 1', 'class 2', 'class 3'], dtype='object')]

In [13]:
covid19_data.axes

[RangeIndex(start=0, stop=206, step=1),
 Index(['country', 'continent', 'date', 'day', 'month', 'year', 'cases',
        'deaths', 'country_code', 'population'],
       dtype='object')]

### DTYPES
 
- This attribute is used to fetch the data type values of the items in the DataFrame.
 
- SYNTAX
```
<DataFrameObject>. <dtypes>```

In [14]:
combined_ages.dtypes

class 1    int64
class 2    int64
class 3    int64
dtype: object

In [15]:
covid19_data.dtypes

country          object
continent        object
date             object
day               int64
month             int64
year              int64
cases             int64
deaths            int64
country_code     object
population      float64
dtype: object

### SIZE
 
- This attribute is used to fetch the size of the DataFrame, which is the product of the number of rows and columns.
- SYNTAX
```
<DataFrameObject>. <size>```

In [64]:
combined_ages.size

6

In [18]:
covid19_data.size

2060

### SHAPE
- This attribute also gives you the size but it also mentions its shape, i.e. the number of rows and number of columns
- SYNTAX
 ```
<DataFrameObject>. <shape> ```

In [20]:
combined_ages.shape

3

In [31]:
covid19_data

Unnamed: 0,country,continent,date,day,month,year,cases,deaths,country_code,population
0,Afghanistan,Asia,2020-04-12,12,4,2020,34,3,AFG,37172386.0
1,Albania,Europe,2020-04-12,12,4,2020,17,0,ALB,2866376.0
2,Algeria,Africa,2020-04-12,12,4,2020,64,19,DZA,42228429.0
3,Andorra,Europe,2020-04-12,12,4,2020,21,2,AND,77006.0
4,Angola,Africa,2020-04-12,12,4,2020,0,0,AGO,30809762.0
...,...,...,...,...,...,...,...,...,...,...
201,Venezuela,America,2020-04-12,12,4,2020,0,0,VEN,28870195.0
202,Vietnam,Asia,2020-04-12,12,4,2020,4,0,VNM,95540395.0
203,Yemen,Asia,2020-04-12,12,4,2020,0,0,YEM,28498687.0
204,Zambia,Africa,2020-04-12,12,4,2020,0,0,ZMB,17351822.0


### NDIM
- This attribute is used to fetch the dimension of the given DataFrame. Like if it is 1-D, 2-D, or 3-D.
 
-SYNTAX
 ```
<DataFrameObject>. <ndim>```

In [21]:
combined_ages.ndim

2

In [22]:
covid19_data.ndim

2

In [75]:
combined_ages

Unnamed: 0,class 1,class 2,class 3
0,22,24,20
1,40,50,45


### EMPTY
- This attribute gives you a Boolean output in the form of true or false, by which we can find if there any emptiness of the DataFrame.
-SYNTAX
```<DataFrameObject>. <empty>```
 
- We have another attribute that can check the presence of NANs (Not a Number).
 
- SYNTAX
```
<DataFrameObject>. <isna()>```

In [23]:
combined_ages.empty

False

In [24]:
covid19_data_na=pd.read_csv("subset-covid-data.csv")

In [25]:
covid19_data_na.empty

False

In [47]:
df_empty = pd.DataFrame({'A' : [],'B' : []})

In [48]:
df_empty

Unnamed: 0,A,B


In [49]:
df_empty.empty

True

In [38]:
combined_ages=combined_ages.append({'class 1':35,'class 2':33,'class 3':21,'class 4':20},ignore_index=True)
combined_ages

  combined_ages=combined_ages.append({'class 1':35,'class 2':33,'class 3':21,'class 4':20},ignore_index=True)


Unnamed: 0,class 1,class 2,class 3,class 4
0,22,24,20,
1,40,50,45,
2,35,33,21,20.0
3,35,33,21,20.0


In [50]:
combined_ages.isna()

Unnamed: 0,class 1,class 2,class 3,class 4
0,False,False,False,True
1,False,False,False,True
2,False,False,False,False
3,False,False,False,False


In [51]:
combined_ages=combined_ages.append({'class 1':23,'class 2':345,'class 3':51,'class 4':22,'class 5':[12,23,34,5],0:23},ignore_index=True)
combined_ages

  combined_ages=combined_ages.append({'class 1':23,'class 2':345,'class 3':51,'class 4':22,'class 5':[12,23,34,5],0:23},ignore_index=True)


Unnamed: 0,class 1,class 2,class 3,class 4,0,class 5
0,22,24,20,,,
1,40,50,45,,,
2,35,33,21,20.0,,
3,35,33,21,20.0,,
4,23,345,51,22.0,23.0,"[12, 23, 34, 5]"


In [54]:
combined_ages.dropna(axis=1)

Unnamed: 0,class 1,class 2,class 3
0,22,24,20
1,40,50,45
2,35,33,21
3,35,33,21
4,23,345,51


In [100]:
covid19_data_na.dropna(axia=1)

Unnamed: 0,date,day,month,year,cases,deaths
0,2020-04-12,12,4,2020,34,3
1,2020-04-12,12,4,2020,17,0
2,2020-04-12,12,4,2020,64,19
3,2020-04-12,12,4,2020,21,2
4,2020-04-12,12,4,2020,0,0
...,...,...,...,...,...,...
201,2020-04-12,12,4,2020,0,0
202,2020-04-12,12,4,2020,4,0
203,2020-04-12,12,4,2020,0,0
204,2020-04-12,12,4,2020,0,0


In [101]:
covid19_data_na

Unnamed: 0,country,continent,date,day,month,year,cases,deaths,country_code,population
0,Afghanistan,,2020-04-12,12,4,2020,34,3,AFG,37172386.0
1,Albania,Europe,2020-04-12,12,4,2020,17,0,ALB,2866376.0
2,,Africa,2020-04-12,12,4,2020,64,19,DZA,42228429.0
3,Andorra,Europe,2020-04-12,12,4,2020,21,2,AND,77006.0
4,Angola,Africa,2020-04-12,12,4,2020,0,0,AGO,30809762.0
...,...,...,...,...,...,...,...,...,...,...
201,Venezuela,America,2020-04-12,12,4,2020,0,0,VEN,28870195.0
202,Vietnam,Asia,2020-04-12,12,4,2020,4,0,VNM,95540395.0
203,Yemen,Asia,2020-04-12,12,4,2020,0,0,YEM,28498687.0
204,Zambia,Africa,2020-04-12,12,4,2020,0,0,ZMB,17351822.0


In [90]:
combined_ages.isna()

Unnamed: 0,class 1,class 2,class 3
0,False,False,False
1,False,False,False


### COUNT
- This attribute gives the count of the items in the DataFrame. By default, it gives the count of the rows.
 
- We can set count (0) or count (1), 0 is for displaying the count of rows (this is by default) and 1 is for displaying the count of columns.
 
- Instead, we can use axis='index' or axis=’columns’
 
SYNTAX
```
<DataFrameObject>. <count ()>```

In [55]:
combined_ages

Unnamed: 0,class 1,class 2,class 3,class 4,0,class 5
0,22,24,20,,,
1,40,50,45,,,
2,35,33,21,20.0,,
3,35,33,21,20.0,,
4,23,345,51,22.0,23.0,"[12, 23, 34, 5]"


In [59]:
True+True+False

2

In [57]:
combined_ages.count(1)

0    3
1    3
2    4
3    4
4    6
dtype: int64

In [60]:
combined_ages

Unnamed: 0,class 1,class 2,class 3,class 4,0,class 5
0,22,24,20,,,
1,40,50,45,,,
2,35,33,21,20.0,,
3,35,33,21,20.0,,
4,23,345,51,22.0,23.0,"[12, 23, 34, 5]"


In [14]:
combined_ages.count(1) #count of columns.

0    3
1    3
dtype: int64

In [16]:
combined_ages.count(axis='columns')

0    3
1    3
dtype: int64

### T
 
- This attribute is used to transpose the DataFrame; i.e., rows becomes columns and columns become rows.
 
- SYNTAX
``` 
<DataFrameObject>. <T>```

In [61]:
combined_ages.T

Unnamed: 0,0,1,2,3,4
class 1,22.0,40.0,35.0,35.0,23
class 2,24.0,50.0,33.0,33.0,345
class 3,20.0,45.0,21.0,21.0,51
class 4,,,20.0,20.0,22.0
0,,,,,23.0
class 5,,,,,"[12, 23, 34, 5]"


In [115]:
combined_ages

Unnamed: 0,class 1,class 2,class 3,0,class 4,class 5
0,22,24,20,31.0,,
1,40,50,45,48.0,,
2,35,33,21,,20.0,
3,23,345,51,,22.0,23.0
4,23,345,51,,22.0,"[12, 23, 34, 5]"
5,23,345,51,23.0,22.0,"[12, 23, 34, 5]"


![image.png](attachment:0e91ce48-c979-46f7-a3f7-07edd290bc9c.png)

In [64]:
import pandas as pd
olympic_data=pd.read_csv("https://raw.githubusercontent.com/svkarthik86/Advance-Python-Numpy/main/olympics.csv",skiprows=4)

In [65]:
olympic_data

Unnamed: 0,City,Edition,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
0,Athens,1896,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100m freestyle,M,Gold
1,Athens,1896,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100m freestyle,M,Silver
2,Athens,1896,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100m freestyle for sailors,M,Bronze
3,Athens,1896,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100m freestyle for sailors,M,Gold
4,Athens,1896,Aquatics,Swimming,"CHASAPIS, Spiridon",GRE,Men,100m freestyle for sailors,M,Silver
...,...,...,...,...,...,...,...,...,...,...
29211,Beijing,2008,Wrestling,Wrestling Gre-R,"ENGLICH, Mirko",GER,Men,84 - 96kg,M,Silver
29212,Beijing,2008,Wrestling,Wrestling Gre-R,"MIZGAITIS, Mindaugas",LTU,Men,96 - 120kg,M,Bronze
29213,Beijing,2008,Wrestling,Wrestling Gre-R,"PATRIKEEV, Yuri",ARM,Men,96 - 120kg,M,Bronze
29214,Beijing,2008,Wrestling,Wrestling Gre-R,"LOPEZ, Mijain",CUB,Men,96 - 120kg,M,Gold


In [66]:
olympic_data["NOC"]

0        HUN
1        AUT
2        GRE
3        GRE
4        GRE
        ... 
29211    GER
29212    LTU
29213    ARM
29214    CUB
29215    RUS
Name: NOC, Length: 29216, dtype: object

In [67]:
olympic_data.NOC

0        HUN
1        AUT
2        GRE
3        GRE
4        GRE
        ... 
29211    GER
29212    LTU
29213    ARM
29214    CUB
29215    RUS
Name: NOC, Length: 29216, dtype: object

In [69]:
olympic_data[['City', 'Edition', 'Athlete','Event', 'Medal']]

Unnamed: 0,City,Edition,Athlete,Event,Medal
0,Athens,1896,"HAJOS, Alfred",100m freestyle,Gold
1,Athens,1896,"HERSCHMANN, Otto",100m freestyle,Silver
2,Athens,1896,"DRIVAS, Dimitrios",100m freestyle for sailors,Bronze
3,Athens,1896,"MALOKINIS, Ioannis",100m freestyle for sailors,Gold
4,Athens,1896,"CHASAPIS, Spiridon",100m freestyle for sailors,Silver
...,...,...,...,...,...
29211,Beijing,2008,"ENGLICH, Mirko",84 - 96kg,Silver
29212,Beijing,2008,"MIZGAITIS, Mindaugas",96 - 120kg,Bronze
29213,Beijing,2008,"PATRIKEEV, Yuri",96 - 120kg,Bronze
29214,Beijing,2008,"LOPEZ, Mijain",96 - 120kg,Gold


In [68]:
olympic_data.columns

Index(['City', 'Edition', 'Sport', 'Discipline', 'Athlete', 'NOC', 'Gender',
       'Event', 'Event_gender', 'Medal'],
      dtype='object')

## Modifying DataFrame objects
-  to change the names of columns and add and delete columns and rows.

### Renaming columns
- The names of the columns can be changed using the rename method.
- A dictionary is passed as an argument to this method. 
- The keys for this dictionary are the old column names, and the values are the new column names.


In [70]:
combined_ages

Unnamed: 0,class 1,class 2,class 3,class 4,0,class 5
0,22,24,20,,,
1,40,50,45,,,
2,35,33,21,20.0,,
3,35,33,21,20.0,,
4,23,345,51,22.0,23.0,"[12, 23, 34, 5]"


In [82]:

combined_ages.rename(columns={'class 6':'class 5','class 5':'class6',''},inplace=True)

In [86]:

combined_ages.rename(columns={"class":"123"},inplace=True) # if key is not matched , ignore

In [83]:
combined_ages

Unnamed: 0,class 1,class 2,class 3,class 4,class 5,class6
0,22,24,20,,,
1,40,50,45,,,
2,35,33,21,20.0,,
3,35,33,21,20.0,,
4,23,345,51,22.0,23.0,"[12, 23, 34, 5]"


- The reason we use the inplace parameter so that the changes are made in the actual DataFrame object


- Renaming can also be done by accessing the columns attribute directly and mentioning the new column names in an array



In [89]:
combined_ages.columns=['batch 1','batch 2','batch 3','batch 4','batch 5','batch 6']

In [90]:
combined_ages

Unnamed: 0,batch 1,batch 2,batch 3,batch 4,batch 5,batch 6
0,22,24,20,,,
1,40,50,45,,,
2,35,33,21,20.0,,
3,35,33,21,20.0,,
4,23,345,51,22.0,23.0,"[12, 23, 34, 5]"


- Renaming using the dictionary format is a more straightforward method for renaming columns, and the changes are made to the original DataFrame object. 
- The disadvantage with this method is that one needs to remember the order of the columns in the DataFrame. 
- When we used the rename method, we used a dictionary where we knew which column names we were changing.

In [91]:
olympic_data.rename(columns={'Discipline':"Games"}) # column name chenged from'Discipline' to "Games"

Unnamed: 0,City,Edition,Sport,Games,Athlete,NOC,Gender,Event,Event_gender,Medal
0,Athens,1896,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100m freestyle,M,Gold
1,Athens,1896,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100m freestyle,M,Silver
2,Athens,1896,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100m freestyle for sailors,M,Bronze
3,Athens,1896,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100m freestyle for sailors,M,Gold
4,Athens,1896,Aquatics,Swimming,"CHASAPIS, Spiridon",GRE,Men,100m freestyle for sailors,M,Silver
...,...,...,...,...,...,...,...,...,...,...
29211,Beijing,2008,Wrestling,Wrestling Gre-R,"ENGLICH, Mirko",GER,Men,84 - 96kg,M,Silver
29212,Beijing,2008,Wrestling,Wrestling Gre-R,"MIZGAITIS, Mindaugas",LTU,Men,96 - 120kg,M,Bronze
29213,Beijing,2008,Wrestling,Wrestling Gre-R,"PATRIKEEV, Yuri",ARM,Men,96 - 120kg,M,Bronze
29214,Beijing,2008,Wrestling,Wrestling Gre-R,"LOPEZ, Mijain",CUB,Men,96 - 120kg,M,Gold


### Replacing values or observations in a DataFrame
- The replace method can be used to replace values in a DataFrame. 
- We can again use the dictionary format, with the key/value pair representing the old and new values. Here, we replace the value 22 with the value 33.

In [92]:
combined_ages

Unnamed: 0,batch 1,batch 2,batch 3,batch 4,batch 5,batch 6
0,22,24,20,,,
1,40,50,45,,,
2,35,33,21,20.0,,
3,35,33,21,20.0,,
4,23,345,51,22.0,23.0,"[12, 23, 34, 5]"


In [95]:
combined_ages.replace({33:300})

Unnamed: 0,batch 1,batch 2,batch 3,batch 4,batch 5,batch 6
0,22,24,20,,,
1,40,50,45,,,
2,35,300,21,20.0,,
3,35,300,21,20.0,,
4,23,345,51,22.0,23.0,"[12, 23, 34, 5]"


### Adding a new column to a DataFrame
There are four ways to insert a new column in a DataFrame


### 1.With the indexing operator, [ ]


In [100]:
combined_ages['batch 7']=[18,40,30,34,4] # if column name is already exist overwrite else create new coloumn
combined_ages

Unnamed: 0,batch 1,batch 2,batch 3,batch 4,batch 5,batch 6,batch 7
0,22,24,20,,,,18
1,40,50,45,,,,40
2,35,33,21,20.0,,,30
3,35,33,21,20.0,,,34
4,23,345,51,22.0,23.0,"[12, 23, 34, 5]",4


In [101]:
combined_ages

Unnamed: 0,batch 1,batch 2,batch 3,batch 4,batch 5,batch 6,batch 7
0,22,24,20,,,,18
1,40,50,45,,,,40
2,35,33,21,20.0,,,30
3,35,33,21,20.0,,,34
4,23,345,51,22.0,23.0,"[12, 23, 34, 5]",4


By mentioning the column name as a string within the indexing operator and assigning it values, we can add a column.

### 2.Using the insert method


In [103]:
combined_ages.insert(2,'batch x',range(10,15))  # if column name is already exist will throw error
combined_ages

Unnamed: 0,batch 1,batch 2,batch x,batch 3,batch 4,batch 5,batch 6,batch 7
0,22,24,10,20,,,,18
1,40,50,11,45,,,,40
2,35,33,12,21,20.0,,,30
3,35,33,13,21,20.0,,,34
4,23,345,14,51,22.0,23.0,"[12, 23, 34, 5]",4


In [104]:
olympic_data.insert(1,"Test",np.arange(29220-4))

In [None]:
olympic_data.insert(1,"Test2",olympic_data.Medal)

In [108]:
olympic_data

Unnamed: 0,City,Test2,Test,Edition,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
0,Athens,Gold,0,1896,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100m freestyle,M,Gold
1,Athens,Silver,1,1896,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100m freestyle,M,Silver
2,Athens,Bronze,2,1896,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100m freestyle for sailors,M,Bronze
3,Athens,Gold,3,1896,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100m freestyle for sailors,M,Gold
4,Athens,Silver,4,1896,Aquatics,Swimming,"CHASAPIS, Spiridon",GRE,Men,100m freestyle for sailors,M,Silver
...,...,...,...,...,...,...,...,...,...,...,...,...
29211,Beijing,Silver,29211,2008,Wrestling,Wrestling Gre-R,"ENGLICH, Mirko",GER,Men,84 - 96kg,M,Silver
29212,Beijing,Bronze,29212,2008,Wrestling,Wrestling Gre-R,"MIZGAITIS, Mindaugas",LTU,Men,96 - 120kg,M,Bronze
29213,Beijing,Bronze,29213,2008,Wrestling,Wrestling Gre-R,"PATRIKEEV, Yuri",ARM,Men,96 - 120kg,M,Bronze
29214,Beijing,Gold,29214,2008,Wrestling,Wrestling Gre-R,"LOPEZ, Mijain",CUB,Men,96 - 120kg,M,Gold


The insert method can be used for adding a column. Three arguments need to be passed to this method, mentioned in the following.
The first argument is the index where you want to insert the new column.
The second argument is the name of the new column you want to insert 
The third argument is the list containing the values of the new column (18 and 35 in this case)
All the three parameters are mandatory for the insert method to be able to add a column successfully.

In [112]:
combined_ages

Unnamed: 0,batch 1,batch 2,batch x,batch 3,batch 4,batch 5,batch 6,batch 7
0,22,24,10,20,,,,18
1,40,50,11,45,,,,40
2,35,33,12,21,20.0,,,30
3,35,33,13,21,20.0,,,34
4,23,345,14,51,22.0,23.0,"[12, 23, 34, 5]",4


### 3.Using the loc indexer 


In [114]:
import numpy as np
combined_ages.loc[:,'batch y']=np.arange(10,15) # if Column is exist it will overwrite the data else create new column
combined_ages

Unnamed: 0,batch 1,batch 2,batch x,batch 3,batch 4,batch 5,batch 6,batch 7,batch y
0,10,24,10,20,,,,18,10
1,11,50,11,45,,,,40,11
2,12,33,12,21,20.0,,,30,12
3,13,33,13,21,20.0,,,34,13
4,14,345,14,51,22.0,23.0,"[12, 23, 34, 5]",4,14


batch 1                   6
batch 3                  23
batch 2                 345
batch 3                  51
batch z                  16
batch y                  16
batch 4                22.0
batch x                   6
batch 5     [12, 23, 34, 5]
batch 6                 NaN
batch 7                23.0
batch 8                  37
batch 12                  6
Name: 6, dtype: object

The loc indexer is generally used for retrieval of values in from Series and DataFrames, but it can also be used for inserting a column. In the preceding statement, all the rows are selected using the : operator. This operator is followed by the name of the column to be inserted. The values for this column are enclosed within a list.

In [115]:
combined_ages

Unnamed: 0,batch 1,batch 2,batch x,batch 3,batch 4,batch 5,batch 6,batch 7,batch y
0,10,24,10,20,,,,18,10
1,11,50,11,45,,,,40,11
2,12,33,12,21,20.0,,,30,12
3,13,33,13,21,20.0,,,34,13
4,14,345,14,51,22.0,23.0,"[12, 23, 34, 5]",4,14


### 4.Using the concat function


In [117]:
batchz=pd.Series(np.arange(5))
combined_ages=pd.concat([combined_ages,batchz],axis=1,ignore_index=True)
combined_ages

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,10,24,10,20,,,,18,10,0,0
1,11,50,11,45,,,,40,11,1,1
2,12,33,12,21,20.0,,,30,12,2,2
3,13,33,13,21,20.0,,,34,13,3,3
4,14,345,14,51,22.0,23.0,"[12, 23, 34, 5]",4,14,4,4


In [120]:
combined_ages.axes

[RangeIndex(start=0, stop=5, step=1), RangeIndex(start=0, stop=11, step=1)]

In [177]:
covid19_data['Test']=np.linspace(1,100,206)

In [178]:
covid19_data

Unnamed: 0,country,continent,date,day,month,year,cases,deaths,country_code,population,Test
0,Afghanistan,Asia,2020-04-12,12,4,2020,34,3,AFG,37172386.0,1.000000
1,Albania,Europe,2020-04-12,12,4,2020,17,0,ALB,2866376.0,1.482927
2,Algeria,Africa,2020-04-12,12,4,2020,64,19,DZA,42228429.0,1.965854
3,Andorra,Europe,2020-04-12,12,4,2020,21,2,AND,77006.0,2.448780
4,Angola,Africa,2020-04-12,12,4,2020,0,0,AGO,30809762.0,2.931707
...,...,...,...,...,...,...,...,...,...,...,...
201,Venezuela,America,2020-04-12,12,4,2020,0,0,VEN,28870195.0,98.068293
202,Vietnam,Asia,2020-04-12,12,4,2020,4,0,VNM,95540395.0,98.551220
203,Yemen,Asia,2020-04-12,12,4,2020,0,0,YEM,28498687.0,99.034146
204,Zambia,Africa,2020-04-12,12,4,2020,0,0,ZMB,17351822.0,99.517073


First, the column to be added (“class5” in this case) is defined as a Series object. It is then addeFalsethe DataFrame object using the pd.concat function. The axis needs to be mentioned as “1” since the new data is being added along the column axis.

- In summary, we can add a column to a DataFrame using the indexing operator, loc indexer, insert method, or concat function. The most straightforward and commonly used method for adding a column is by using the indexing operator [].

## Inserting rows in a DataFrame
- There are two methods for adding rows in a DataFrame, either by using the append method or with the concat function.

### 1.Using the append method



In [33]:
combined_ages=combined_ages.append({'class 1':35,'class 2':33,'class 3':21},ignore_index=True)
combined_ages

  combined_ages=combined_ages.append({'class 1':35,'class 2':33,'class 3':21},ignore_index=True)


Unnamed: 0,batch 1,batch 2,class 0,batch 3,class 4,0,class 1,class 2,class 3
0,22.0,24.0,18.0,20.0,20.0,31.0,,,
1,40.0,50.0,35.0,45.0,40.0,48.0,,,
2,35.0,33.0,46.0,21.0,30.0,,,,
3,,,,,,,35.0,33.0,21.0


The argument to the append method- the data that needs to be added - is defined as a dictionary. This dictionary is then passed as an argument to the append method. Setting the ignore_index=True parameter prevents an error from being thrown. 
This parameter resets the index. While using the append method, we need to ensure that we either use the ignore_index parameter
or give a name to a Series before appending it to a DataFrame.
Note that the append method does not have an inplace parameter that would ensure that the changes reflect in the original object; hence we need to set the original object to point to the new object created using append, as shown in the preceding code

### 2.Using the pd.concat function


In [34]:
new_row=pd.DataFrame([{'class 1':32,'class 2':37,'class 3':41}])
pd.concat([combined_ages,new_row])

Unnamed: 0,batch 1,batch 2,class 0,batch 3,class 4,0,class 1,class 2,class 3
0,22.0,24.0,18.0,20.0,20.0,31.0,,,
1,40.0,50.0,35.0,45.0,40.0,48.0,,,
2,35.0,33.0,46.0,21.0,30.0,,,,
3,,,,,,,35.0,33.0,21.0
0,,,,,,,32.0,37.0,41.0


The pd.concat function is used to add new rows as shown in the preceding syntax. The new row to be added is defined as a DataFrame object. Then the pd.concat function is called and the names of the two DataFrames (the original DataFrame and the new row defined as a DataFrame) are passed as arguments.

- In summary, we can use either the append method or concat function for adding rows to a DataFrame.

## Deleting columns from a DataFrame
- Three methods can be used to delete a column from a DataFrame

### 1.del function 


In [35]:
del combined_ages['class 3']
combined_ages

Unnamed: 0,batch 1,batch 2,class 0,batch 3,class 4,0,class 1,class 2
0,22.0,24.0,18.0,20.0,20.0,31.0,,
1,40.0,50.0,35.0,45.0,40.0,48.0,,
2,35.0,33.0,46.0,21.0,30.0,,,
3,,,,,,,35.0,33.0


The preceding statement deletes the last column (with the name,“class 3”).
Note that the deletion occurs inplace, that is, in the original DataFrame itself.

### 2.Using the pop method CODE:


In [36]:
combined_ages.pop('class 2')

0     NaN
1     NaN
2     NaN
3    33.0
Name: class 2, dtype: float64

The pop method deletes a column inplace and returns the deleted column as a Series object

### 3.Using the drop method CODE:


In [37]:
combined_ages.drop(['class 1'],axis=1,inplace=True)
combined_ages

Unnamed: 0,batch 1,batch 2,class 0,batch 3,class 4,0
0,22.0,24.0,18.0,20.0,20.0,31.0
1,40.0,50.0,35.0,45.0,40.0,48.0
2,35.0,33.0,46.0,21.0,30.0,
3,,,,,,


The column(s) that needs to be dropped is mentioned as a string within a list, which is then passed as an argument to the drop
method. Since the drop method removes rows (axis=0) by default, we need to specify the axis value as “1” if we want to remove a
column.
Unlike the del function and pop method, the deletion using the drop method does not occur in the original DataFrame object, and
therefore, we need to add the inplace parameter.

- To sum up, we can use the del function, pop method, or drop method to delete a column from a DataFrame

### Deleting a row from a DataFrame
- There are two methods for removing rows from a DataFrame – either by using a Boolean selection or by using the drop method

### Using a Boolean selection


In [38]:

combined_ages[~(combined_ages.values<50)]

Unnamed: 0,batch 1,batch 2,class 0,batch 3,class 4,0
1,40.0,50.0,35.0,45.0,40.0,48.0
2,35.0,33.0,46.0,21.0,30.0,
3,,,,,,
3,,,,,,
3,,,,,,
3,,,,,,
3,,,,,,
3,,,,,,


We use the NOT operator (~) to remove the rows that we do not want. Here, we remove all values in the DataFrame that are less than 50.

### Using the drop method


In [40]:
combined_ages.drop(1) #remove the second row, which has a row index of 1

Unnamed: 0,batch 1,batch 2,class 0,batch 3,class 4,0
0,22.0,24.0,18.0,20.0,20.0,31.0
2,35.0,33.0,46.0,21.0,30.0,
3,,,,,,


Here, we remove the second row, which has a row index of 1. If there is more than one row to be removed, we need to specify the indexes of the rows in a list.

Thus, we can use either a Boolean selection or the drop method to remove rows from a DataFrame.Since the drop method works with the removal of both rows and columns, it can be used uniformly.Remember to add the required parameters to the drop method.
For removing columns, the axis (=1) parameter needs to be added. For changes to reflect in the original DataFrame, the inplace (=True) parameter needs to be included.

## Indexing
- Indexing is fundamental to Pandas and is what makes retrieval and access to data much faster compared to other tools.
- It is crucial to set an appropriate index to optimize performance.
- An index is implemented in NumPy as an immutable (cannot be modified)
- array and contains hashable objects. A hashable object is one that can be converted to an integer value based on its contents (similar to mapping in a dictionary). Objects with different values will have different hash values.
- Pandas has two types of indexes - a row index (vertical) with labels attached to rows, and a column index with labels (column names) for every column.
- Let us now explore index objects – their data types, their properties, and how they speed up access to data.

### Type of an index object
- An index object has a data type, some of which are listed here.
- • Index: This is a generic index type; the column index has this type.
- • RangeIndex: Default index type in Pandas (used when an index is not defined separately), implemented as a range of increasing integers. This index type helps with saving memory.
- • Int64Index: An index type containing integers as labels. For this index type, the index labels need not be equally spaced, whereas this is required for an index of type RangeIndex.
- • Float64Index: Contains floating-point numbers (numbers with a decimal point) as index labels.
- • IntervalIndex: Contains intervals (for instance, the interval between two integers) as labels.
- • CategoricalIndex: A limited and finite set of values.
- • DateTimeIndex: Used to represent date and time, like in time-series data.
- • PeriodIndex: Represents periods like quarters, months, or years.
- • TimedeltaIndex: Represents duration between two periods of time or two dates.
- • MultiIndex: Hierarchical index with multiple levels.

Learn more about types of indexes here: https://pandas.pydata.org/pandas-docs/
stable/reference/api/pandas.Index.html

## Creating a custom index and using columns as indexes
- When a Pandas object is created, a default index is created of the type RangeIndex
- An index of this type has the first label value as 0 and the second label as 1, following an arithmetic progression with a spacing of one integer.
- We can set a customized index, using either the index parameter or attribute. 
- In the Series and DataFrame objects in the absence of labels for the index object, the default index (of type RangeIndex) was used.
- We can use the index parameter when we define a Series or DataFrame to give custom values to the index labels.

In [41]:
periodic_table=pd.DataFrame({'Element':['Hydrogen','Helium','Lithium','Beryllium','Boron']},index=['H','He','Li','Be','B'])