<a href="https://colab.research.google.com/github/Sanjeevuvs/DATAVisualization/blob/main/COM167_2_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Pandas** 

## **Descriptive Statistics**


---


A large number of methods collectively compute descriptive statistics and other related operations on DataFrame. 

Most of these are aggregations like sum(), mean(), but some of them, like sumsum(), produce an object of the same size. 

Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, ...}, but the axis can be specified by name or integer

**DataFrame − “index” (axis=0, default), “columns” (axis=1)**

In [None]:
import pandas as pd
import numpy as np
 
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
 
#Create a DataFrame
df = pd.DataFrame(d)
print (df)

      Name  Age  Rating
0      Tom   25    4.23
1    James   26    3.24
2    Ricky   25    3.98
3      Vin   23    2.56
4    Steve   30    3.20
5    Smith   29    4.60
6     Jack   23    3.80
7      Lee   34    3.78
8    David   40    2.98
9   Gasper   30    4.80
10  Betina   51    4.10
11  Andres   46    3.65


### **sum()**

Returns the sum of the values for the requested axis. By default, axis is index (axis=0).

In [None]:
print (df.sum())

Name      TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Age                                                     382
Rating                                                44.92
dtype: object


Each individual column is added individually (Strings are appended).

### **axis=1**
This syntax will give the output as shown below.

In [None]:
print (df.sum(1))

0     29.23
1     29.24
2     28.98
3     25.56
4     33.20
5     33.60
6     26.80
7     37.78
8     42.98
9     34.80
10    55.10
11    49.65
dtype: float64


### **mean()**
Returns the average value

In [None]:
print (df.mean())

Age       31.833333
Rating     3.743333
dtype: float64


### **std()**
Returns the Bressel standard deviation of the numerical columns.

In [None]:
print (df.std())

Age       9.232682
Rating    0.661628
dtype: float64


### **Functions & Description**
Let us now understand the functions under Descriptive Statistics in Python Pandas. 

The following table list down the important functions −

Sr.No.	| Function |	Description
:---|:----|:----
1 |	count() |	Number of non-null observations
2 |	sum() |	Sum of values
3 |	mean() |	Mean of Values
4 |	median() |	Median of Values
5 |	mode() |	Mode of values
6 |	std() |	Standard Deviation of the Values
7 |	min() |	Minimum Value
8 |	max() |	Maximum Value
9 |	abs() |	Absolute Value
10 |	prod() |	Product of Values
11 |	cumsum() |	Cumulative Sum
12 |	cumprod() |	Cumulative Product

**Note** − Since DataFrame is a Heterogeneous data structure. Generic operations don’t work with all functions.

Functions like **sum(), cumsum()** work with both numeric and character (or) string data elements without any error. Though **n** practice, character aggregations are never used generally, these functions do not throw any exception.

Functions like a**bs(), cumprod()** throw exception when the DataFrame contains character or string data because such operations cannot be performed.

### **Summarizing Data**
The describe() function computes a summary of statistics pertaining to the DataFrame columns.

In [None]:
print (df.describe())

             Age     Rating
count  12.000000  12.000000
mean   31.833333   3.743333
std     9.232682   0.661628
min    23.000000   2.560000
25%    25.000000   3.230000
50%    29.500000   3.790000
75%    35.500000   4.132500
max    51.000000   4.800000


This function gives the **mean**, **std** and **IQR** values. 

And, function excludes the character columns and given summary about numeric columns.

 **'include'** is the argument which is used to pass necessary information regarding what columns need to be considered for summarizing. 
 
Takes the list of values; by default, 'number'.

* **object** − Summarizes String columns
* **number** − Summarizes Numeric columns
* all − Summarizes all columns together (Should not pass it as a list value)

Now, use the following statement in the program and check the output −

In [None]:
print (df.describe(include=['object']))

          Name
count       12
unique      12
top     Gasper
freq         1


In [None]:
print (df. describe(include='all'))

          Name        Age     Rating
count       12  12.000000  12.000000
unique      12        NaN        NaN
top     Gasper        NaN        NaN
freq         1        NaN        NaN
mean       NaN  31.833333   3.743333
std        NaN   9.232682   0.661628
min        NaN  23.000000   2.560000
25%        NaN  25.000000   3.230000
50%        NaN  29.500000   3.790000
75%        NaN  35.500000   4.132500
max        NaN  51.000000   4.800000



### **Reindexing**
---
Reindexing changes the row labels and column labels of a DataFrame. To reindex means to conform the data to match a given set of labels along a particular axis.

Multiple operations can be accomplished through indexing like −

* Reorder the existing data to match a new set of labels.

* Insert missing value (NA) markers in label locations where no data for the label existed.


In [None]:
import pandas as pd
import numpy as np
 
N=20
 
df = pd.DataFrame({
   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
   'x': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
})
print(df)
#reindex the DataFrame
df_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B'])
 
print (df_reindexed)

            A     x         y       C           D
0  2016-01-01   0.0  0.596279     Low   96.966444
1  2016-01-02   1.0  0.465360  Medium  102.807429
2  2016-01-03   2.0  0.130458  Medium   99.136195
3  2016-01-04   3.0  0.578851    High  103.402639
4  2016-01-05   4.0  0.488387     Low  106.829620
5  2016-01-06   5.0  0.415212     Low  107.577507
6  2016-01-07   6.0  0.691921    High   84.999077
7  2016-01-08   7.0  0.782964  Medium  107.944863
8  2016-01-09   8.0  0.308572    High   95.152707
9  2016-01-10   9.0  0.608406  Medium  116.167901
10 2016-01-11  10.0  0.918118  Medium   84.386860
11 2016-01-12  11.0  0.164710     Low  108.831594
12 2016-01-13  12.0  0.305058  Medium   98.042604
13 2016-01-14  13.0  0.925457     Low   89.580616
14 2016-01-15  14.0  0.772258  Medium   90.220278
15 2016-01-16  15.0  0.379422     Low   79.944290
16 2016-01-17  16.0  0.144606  Medium   92.793142
17 2016-01-18  17.0  0.838892  Medium   98.258030
18 2016-01-19  18.0  0.199778    High   99.817446


**Reindex to Align with Other Objects**

You may wish to take an object and reindex its axes to be labeled the same as another object. Consider the following example to understand the same.

In [None]:
import pandas as pd
import numpy as np
 
df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])
 
df1 = df1.reindex_like(df2)
print (df1)

       col1      col2      col3
0  1.948740 -0.110155  1.076390
1 -2.067672 -0.385034  0.041916
2  1.312509 -1.340803 -0.206062
3  1.660343  0.476717 -0.790884
4  1.576077 -1.430255 -0.412776
5 -0.528468  1.206081  0.368698
6 -0.193200 -1.152565  0.821028


**Note −** Here, the df1 DataFrame is altered and reindexed like df2. The column names should be matched or else NAN will be added for the entire column label.

**Filling while ReIndexing**

**reindex()** takes an optional parameter method which is a filling method with values as follows −

* pad/ffill − Fill values forward

* bfill/backfill − Fill values backward

* nearest − Fill from the nearest index values

In [None]:
import pandas as pd
import numpy as np
 
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
 
# Padding NAN's
print (df2.reindex_like(df1))
 
# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill:")
print (df2.reindex_like(df1,method='ffill'))

       col1      col2      col3
0  1.072788  0.474214  1.060543
1 -2.199235  2.279088  1.162261
2       NaN       NaN       NaN
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN
Data Frame with Forward Fill:
       col1      col2      col3
0  1.072788  0.474214  1.060543
1 -2.199235  2.279088  1.162261
2 -2.199235  2.279088  1.162261
3 -2.199235  2.279088  1.162261
4 -2.199235  2.279088  1.162261
5 -2.199235  2.279088  1.162261


**Note** − The last four rows are padded.

**Limits on Filling while Reindexing**

The limit argument provides additional control over filling while reindexing. Limit specifies the maximum count of consecutive matches. 

In [None]:
import pandas as pd
import numpy as np
 
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
 
# Padding NAN's
print (df2.reindex_like(df1))
 
# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill limiting to 1:")
print (df2.reindex_like(df1,method='ffill',limit=1))

       col1      col2      col3
0  0.161809  0.021698 -0.068214
1  0.537260  0.917941 -0.256427
2       NaN       NaN       NaN
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN
Data Frame with Forward Fill limiting to 1:
       col1      col2      col3
0  0.161809  0.021698 -0.068214
1  0.537260  0.917941 -0.256427
2  0.537260  0.917941 -0.256427
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN


**Note** − Observe, only the 7th row is filled by the preceding 6th row. Then, the rows are left as they are.

**Renaming**

The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.



In [None]:
import pandas as pd
import numpy as np
 
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
print (df1)
 
print ("After renaming the rows and columns:")
print (df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},
index = {0 : 'apple', 1 : 'banana', 2 : 'durian'}))

       col1      col2      col3
0  0.338082 -0.030888  0.563512
1  0.491571  0.119794  0.207512
2  0.154062  1.169741 -1.994683
3 -0.755057  0.088438 -0.086876
4  1.396772  1.395794  0.643098
5 -0.411405 -0.900120  0.494318
After renaming the rows and columns:
              c1        c2      col3
apple   0.338082 -0.030888  0.563512
banana  0.491571  0.119794  0.207512
durian  0.154062  1.169741 -1.994683
3      -0.755057  0.088438 -0.086876
4       1.396772  1.395794  0.643098
5      -0.411405 -0.900120  0.494318


The rename() method provides an **inplace** named parameter, which by default is False and copies the underlying data. Pass inplace=True to rename the data in place.

## **Iteration**


---

The behavior of basic iteration over Pandas objects depends on the type. 

When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. 

Other data structures, like DataFrame and Panel, follow the **dict-like** convention of iterating over the **keys** of the objects.

In short, basic iteration (for i in object) produces −

* Series − values

* DataFrame − column labels

* Panel − item labels

**Iterating a DataFrame**

Iterating a DataFrame gives column names. 

In [None]:
import pandas as pd
import numpy as np
 
N=20
df = pd.DataFrame({
   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
   'x': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
   })
 
 
for col in df:
   print (col) # Prints column names

A
x
y
C
D


To iterate over the rows of the DataFrame, we can use the following functions −

* **iteritems()** − to iterate over the (key,value) pairs

* **iterrows()** − iterate over the rows as (index,series) pairs

* **itertuples()** − iterate over the rows as namedtuples

**iteritems()**

Iterates over each column as key, value pair with label as key and column value as a Series object.

In [None]:
import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3'])
for key,value in df.iteritems():
   print (key,value)

col1 0    0.397422
1    0.049216
2    0.181037
3    0.265547
Name: col1, dtype: float64
col2 0   -0.316693
1    1.572868
2   -0.011679
3   -2.529440
Name: col2, dtype: float64
col3 0    0.524836
1    0.084524
2   -0.372685
3    0.562975
Name: col3, dtype: float64


Observe, each column is iterated separately as a key-value pair in a Series.

**iterrows()**

iterrows() returns the iterator yielding each index value along with a series containing the data in each row.

In [None]:
for row_index,row in df.iterrows():
   print (row_index,row)

0 col1    0.397422
col2   -0.316693
col3    0.524836
Name: 0, dtype: float64
1 col1    0.049216
col2    1.572868
col3    0.084524
Name: 1, dtype: float64
2 col1    0.181037
col2   -0.011679
col3   -0.372685
Name: 2, dtype: float64
3 col1    0.265547
col2   -2.529440
col3    0.562975
Name: 3, dtype: float64


**Note** − Because iterrows() iterate over the rows, it doesn't preserve the data type across the row. 0,1,2 are the row indices and col1,col2,col3 are column indices.

**itertuples()**

itertuples() method will return an iterator yielding a named tuple for each row in the DataFrame. The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.

In [None]:
for row in df.itertuples():
    print (row)

Pandas(Index=0, col1=0.397422392343423, col2=-0.31669253240292883, col3=0.524835612295341)
Pandas(Index=1, col1=0.04921617089242713, col2=1.572867677720913, col3=0.08452419079234313)
Pandas(Index=2, col1=0.18103726070372794, col2=-0.011678591112698278, col3=-0.37268531796869164)
Pandas(Index=3, col1=0.2655474137138689, col2=-2.5294402775768208, col3=0.562975484526443)


**Note** − Do not try to modify any object while iterating. Iterating is meant for reading and the iterator returns a copy of the original object (a view), thus the changes will not reflect on the original object.

In [None]:
for index, row in df.iterrows():
   row['a'] = 10
print (df)

       col1      col2      col3
0  0.397422 -0.316693  0.524836
1  0.049216  1.572868  0.084524
2  0.181037 -0.011679 -0.372685
3  0.265547 -2.529440  0.562975


## **Sorting**


---

There are two kinds of sorting available in Pandas. They are −

* By label
* By Actual Value

Let us consider an example with an output.

In [None]:
unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])
print (unsorted_df)

       col2      col1
1  1.639513 -0.933718
4  0.130597  0.194048
6  0.057882  1.242845
2 -0.310099 -1.051568
3 -1.097428 -0.080387
5  1.154381  0.598918
9  0.719102  1.857120
8  0.469061  1.389437
0  1.452905  0.483890
7  1.948133  1.216126


In unsorted_df, the labels and the values are unsorted. Let us see how these can be sorted.

**By Label**

Using the sort_index() method, by passing the axis arguments and the order of sorting, DataFrame can be sorted. By default, sorting is done on row labels in ascending order.

In [None]:
sorted_df=unsorted_df.sort_index()
print (sorted_df)

       col2      col1
0  1.452905  0.483890
1  1.639513 -0.933718
2 -0.310099 -1.051568
3 -1.097428 -0.080387
4  0.130597  0.194048
5  1.154381  0.598918
6  0.057882  1.242845
7  1.948133  1.216126
8  0.469061  1.389437
9  0.719102  1.857120


**Order of Sorting**

By passing the Boolean value to ascending parameter, the order of the sorting can be controlled. Let us consider the following example to understand the same.

In [None]:
sorted_df = unsorted_df.sort_index(ascending=False)
print (sorted_df)

       col2      col1
9  0.719102  1.857120
8  0.469061  1.389437
7  1.948133  1.216126
6  0.057882  1.242845
5  1.154381  0.598918
4  0.130597  0.194048
3 -1.097428 -0.080387
2 -0.310099 -1.051568
1  1.639513 -0.933718
0  1.452905  0.483890


**Sort the Columns**

By passing the axis argument with a value 0 or 1, the sorting can be done on the column labels. By default, axis=0, sort by row. 

In [None]:
sorted_df=unsorted_df.sort_index(axis=1)
 
print (sorted_df)

       col1      col2
1 -0.933718  1.639513
4  0.194048  0.130597
6  1.242845  0.057882
2 -1.051568 -0.310099
3 -0.080387 -1.097428
5  0.598918  1.154381
9  1.857120  0.719102
8  1.389437  0.469061
0  0.483890  1.452905
7  1.216126  1.948133


**By Value**

Like index sorting, **sort_values()** is the method for sorting by values. 

It accepts a '**by**' argument which will use the column name of the DataFrame with which the values are to be sorted.

In [None]:
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1')
 
print (sorted_df)

   col1  col2
1     1     3
2     1     2
3     1     4
0     2     1


Observe, **col1** values are sorted and the respective **col2** value and row index will alter along with col1. Thus, they look unsorted.

'**by**' argument takes a list of column values.

In [None]:
sorted_df = unsorted_df.sort_values(by=['col1','col2'])
 
print (sorted_df)

   col1  col2
2     1     2
1     1     3
3     1     4
0     2     1


**Sorting Algorithm**

sort_values() provides a provision to choose the algorithm from mergesort, heapsort and quicksort. 

Mergesort is the only stable algorithm.

In [None]:
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1' ,kind='mergesort')
 
print (sorted_df)

   col1  col2
1     1     3
2     1     2
3     1     4
0     2     1


### **Working with Text Data**

---

Pandas provides a set of string functions which make it easy to operate on string data. 

Most importantly, these functions ignore (or exclude) missing/NaN values.

Almost, all of these methods work with Python string functions (refer: https://docs.python.org/3/library/stdtypes.html#string-methods). 

So, convert the Series Object to String Object and then perform the operation.

Sr.No	| Function |  Description
:------|:------|:------
1		| lower()	| Converts strings in the Series/Index to lower case.
2	| upper()| Converts strings in the Series/Index to upper case.
3	| len()| Computes String length().
4	| strip()| Helps strip whitespace(including newline) from each string in the Series/index from both the sides.
5	| split(' ')| Splits each string with the given pattern.
6	| cat(sep=' ')| Concatenates the series/index elements with given separator.
7	| get_dummies()| Returns the DataFrame with One-Hot Encoded values.
8	| contains(pattern)| Returns a Boolean value True for each element if the substring contains in the element, else False.
9	| replace(a,b)| Replaces the value a with the value b.
10	| repeat(value)| Repeats each element with specified number of times.
11	| count(pattern)| Returns count of appearance of pattern in each element.
12	| startswith(pattern)| Returns true if the element in the Series/Index starts with the pattern.
13	| endswith(pattern)| Returns true if the element in the Series/Index ends with the pattern.
14	| find(pattern)| Returns the first position of the first occurrence of the pattern.
15	| findall(pattern)| Returns a list of all occurrence of the pattern.
16	| swapcase| Swaps the case lower/upper.
17	| islower()| Checks whether all characters in each string in the Series/Index in lower case or not. Returns Boolean
18	| isupper()| Checks whether all characters in each string in the Series/Index in upper case or not. Returns Boolean.
19	| isnumeric()| Checks whether all characters in each string in the Series/Index are numeric. Returns Boolean.

Let us now create a Series and see how all the above functions work.

In [None]:
s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
 
print (s)

0             Tom
1    William Rick
2            John
3         Alber@t
4             NaN
5            1234
6      SteveSmith
dtype: object


**lower()**

In [None]:
print (s.str.lower())

0             tom
1    william rick
2            john
3         alber@t
4             NaN
5            1234
6      stevesmith
dtype: object


**upper()**

In [None]:
print (s.str.upper())

0             TOM
1    WILLIAM RICK
2            JOHN
3         ALBER@T
4             NaN
5            1234
6      STEVESMITH
dtype: object


**len()**

In [None]:
print (s.str.len())

0     3.0
1    12.0
2     4.0
3     7.0
4     NaN
5     4.0
6    10.0
dtype: float64


#**Exercises** 

1. **Write a Python program to add, subtract, multiple and divide two Pandas Series.** 

Sample Series: [2, 4, 6, 8, 10], [1, 3, 5, 7, 9]

In [None]:
import pandas as pd
A=pd.Series([2, 4, 6, 8, 10])
B=pd.Series([1,3,5,7,9])
Add=A+B
Sub=A-B
Mult=A*B
Div=A/B
print("addition of A and B is \n",Add)
print("subtraction of A and B is\n",Sub)
print("multiple of A and B is \n",Mult)
print("division of A and B is\n ",Div)

addition of A and B is 
 0     3
1     7
2    11
3    15
4    19
dtype: int64
subtraction of A and B is
 0    1
1    1
2    1
3    1
4    1
dtype: int64
multiple of A and B is 
 0     2
1    12
2    30
3    56
4    90
dtype: int64
division of A and B is
  0    2.000000
1    1.333333
2    1.200000
3    1.142857
4    1.111111
dtype: float64


2. **Write a Pandas program to sort a given Series.** 

*Sample Output:*

Original Data Series: 
0 100

1 200

2 python

3 300.12

4 400

Output Data Series:

0 100

1 200

3 300.12

4 400

2 python


In [None]:
S=pd.Series(['100','200','python','300.12','400'])
S=pd.DataFrame(S,columns=['values'])
S.sort_values('values',kind='mergesort')

Unnamed: 0,values
0,100
1,200
3,300.12
4,400
2,python


3. **Write a Pandas program to compute the minimum, 25th percentile, median, 75th, and maximum of a given series.**

In [None]:
 
import numpy as np
Rad=pd.Series(np.arange(20,50))
print(Rad)
print("the minimum value is",min(Rad))
 
Pad=np.percentile(Rad, q=[ 25,50, 75])
 
print("The 25th,median and 75 th percentile are")
print(Pad)
print("the maximum value is",max(Rad))

0     20
1     21
2     22
3     23
4     24
5     25
6     26
7     27
8     28
9     29
10    30
11    31
12    32
13    33
14    34
15    35
16    36
17    37
18    38
19    39
20    40
21    41
22    42
23    43
24    44
25    45
26    46
27    47
28    48
29    49
dtype: int64
the minimum value is 20
The 25th,median and 75 th percentile are
[27.25 34.5  41.75]
the maximum value is 49


4.  **Write a Pandas program to find the positions of numbers that are multiples of 5 of a given series.**

In [None]:
import pandas as pd
import numpy as np
num_series = pd.Series(np.random.randint(1, 30, 9))
print("Original Series:")
print(num_series)
result=[]
for i in range(len(num_series)):
   if num_series[i] % 5 == 0:
        result.append(i)
#result =np.argwhere(num_series % 5 == 0)
print("Positions of numbers that are multiples of 5:")
print(result)

Original Series:
0     6
1    26
2    14
3    11
4    25
5    19
6    11
7    16
8    28
dtype: int64
Positions of numbers that are multiples of 5:
[4]


5. **Write a Pandas program to display a summary of the basic information about a specified DataFrame and its data.**

*Sample Python dictionary data and list labels:*

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

In [None]:
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'], 'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19], 'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1], 'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']} 
df=pd.DataFrame(exam_data,index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
print(df)
print("\n only object datatypes  \n \n",df.describe(include=['object']))
print("\n all datatypes \n \n",df.describe(include='all'))
print("\n without object datattpes \n \n",df.describe())

        name  score  attempts qualify
a  Anastasia   12.5         1     yes
b       Dima    9.0         3      no
c  Katherine   16.5         2     yes
d      James    NaN         3      no
e      Emily    9.0         2      no
f    Michael   20.0         3     yes
g    Matthew   14.5         1     yes
h      Laura    NaN         1      no
i      Kevin    8.0         2      no
j      Jonas   19.0         1     yes

 only object datatypes  
 
            name qualify
count        10      10
unique       10       2
top     Michael      no
freq          1       5

 all datatypes 
 
            name      score   attempts qualify
count        10   8.000000  10.000000      10
unique       10        NaN        NaN       2
top     Michael        NaN        NaN      no
freq          1        NaN        NaN       5
mean        NaN  13.562500   1.900000     NaN
std         NaN   4.693746   0.875595     NaN
min         NaN   8.000000   1.000000     NaN
25%         NaN   9.000000   1.000000     NaN

6. **Write a Pandas program to convert all the string values to upper, lower cases in a given pandas series. Also find the length of the string values.**

In [None]:
 
s = pd.Series(['Tom', 'johnwick', 'John', 'Alber@t', '1234','Steve'])
 
print (s)
print(s.str.upper())
print(s.str.lower())
print(s.str.len())

0         Tom
1    johnwick
2        John
3     Alber@t
4        1234
5       Steve
dtype: object
0         TOM
1    JOHNWICK
2        JOHN
3     ALBER@T
4        1234
5       STEVE
dtype: object
0         tom
1    johnwick
2        john
3     alber@t
4        1234
5       steve
dtype: object
0    3
1    8
2    4
3    7
4    4
5    5
dtype: int64


7. **Write a Pandas program to check whether only numeric values present in a given column of a DataFrame.**

In [None]:
 
import pandas as pd
df = pd.DataFrame({
    'company_code': ['Company','Company a001', '2055', 'abcd', '123345'],
    'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})
    
print("Original DataFrame:")
print(df)
print("\nNumeric values present in company_code column:")
df['company_code_is_digit'] = list(map(lambda x: x.isnumeric(), df['company_code']))
print(df)

Original DataFrame:
   company_code date_of_sale   sale_amount
0       Company    12/05/2002      12348.5
1  Company a001    16/02/1999     233331.2
2          2055    25/09/1998         22.5
3          abcd    12/02/2022    2566552.0
4        123345    15/09/1997         23.0

Numeric values present in company_code column:
   company_code date_of_sale   sale_amount  company_code_is_digit
0       Company    12/05/2002      12348.5                  False
1  Company a001    16/02/1999     233331.2                  False
2          2055    25/09/1998         22.5                   True
3          abcd    12/02/2022    2566552.0                  False
4        123345    15/09/1997         23.0                   True


In [None]:
a={'name':['ravi','123','gov'],'age':['15','thirty two','25'],'sex':["M","F","M"]}
df=pd.DataFrame(a)
print(df)
df['age_digit'] = list(map(lambda x: x.isdigit(), df['age']))
print(df)

   name         age sex
0  ravi          15   M
1   123  thirty two   F
2   gov          25   M
   name         age sex  age_digit
0  ravi          15   M       True
1   123  thirty two   F      False
2   gov          25   M       True
