# 1. Working with missing data:
In pandas, when some values are missing or not collected properly,<br> 
these values are represented by<br>
<b>* None:</b> A Python object used to represent missing values in object-type arrays.<br>
<b>* NaN:</b> A special floating-point value from NumPy which is recognized by all systems that use IEEE floating-point standards.<br>
<h2>some functions for checking missing values in pandas:</h2>
<h1>1. Using isnull():</h1>
isnull() returns a DataFrame of Boolean value where True represents missing data (NaN).

In [1]:
#identify missing values use isnull() func
# creat df 
import pandas as pd
import numpy as np
df = pd.DataFrame({"First_score":[100,88,np.nan,50],
                  "Second":[np.nan,44,np.nan,49]})
print(df)
#check isnull()
print(df.isnull())

   First_score  Second
0        100.0     NaN
1         88.0    44.0
2          NaN     NaN
3         50.0    49.0
   First_score  Second
0        False    True
1        False   False
2         True    True
3        False   False


In [2]:
#example 2
# filternig missing values using on a csv file
import pandas as pd
file = pd.read_csv(r"C:\Users\ASHRAF\OneDrive\Desktop\all_of_pandas\employees.csv")
# in this file some of the gender values are missing find those things.
boolean_value = pd.isnull(file["Gender"])
missing_gender=file[boolean_value]
print(missing_gender)

    First Name Gender  Start Date Last Login Time  Salary  Bonus %  \
20        Lois    NaN   4/22/1995         7:18 PM   64714    4.934   
22      Joshua    NaN    3/8/2012         1:58 AM   90816   18.816   
27       Scott    NaN   7/11/1991         6:58 PM  122367    5.218   
31       Joyce    NaN   2/20/2005         2:40 PM   88657   12.752   
41   Christine    NaN   6/28/2015         1:08 AM   66582   11.308   
..         ...    ...         ...             ...     ...      ...   
961    Antonio    NaN   6/18/1989         9:37 PM  103050    3.050   
972     Victor    NaN   7/28/2006         2:49 PM   76381   11.159   
985    Stephen    NaN   7/10/1983         8:10 PM   85668    1.909   
989     Justin    NaN   2/10/1991         4:58 PM   38344    3.794   
995      Henry    NaN  11/23/2014         6:09 AM  132483   16.655   

    Senior Management                  Team  
20               True                 Legal  
22               True       Client Services  
27              False

<h1>2. Checking for Non-Missing Values Using notnull():</h1>
notnull() function returns a DataFrame with Boolean values where True indicates non-missing (valid) data.<br>
This function is useful when we want to focus only on the rows that have valid, non-missing values.

In [3]:
# example1:
#identify non_missing values
import pandas as pd
data = pd.DataFrame({"First_score":[100,88,np.nan,50],
                  "Second":[np.nan,44,np.nan,49]})
#identify 
print(data.notnull())

   First_score  Second
0         True   False
1         True    True
2        False   False
3         True    True


In [4]:
# example2:
#filtering non_missing values
import pandas as pd
url = r"C:\Users\ASHRAF\OneDrive\Desktop\all_of_pandas\employees.csv"
data_csv = pd.read_csv(url)
# show only non_missing values of gender
n_m_g = pd.notnull(data_csv["Gender"])
n_m_g_d = data_csv[n_m_g]
print(n_m_g_d)

    First Name  Gender Start Date Last Login Time  Salary  Bonus %  \
0      Douglas    Male   8/6/1993        12:42 PM   97308    6.945   
1       Thomas    Male  3/31/1996         6:53 AM   61933    4.170   
2        Maria  Female  4/23/1993        11:17 AM  130590   11.858   
3        Jerry    Male   3/4/2005         1:00 PM  138705    9.340   
4        Larry    Male  1/24/1998         4:47 PM  101004    1.389   
..         ...     ...        ...             ...     ...      ...   
994     George    Male  6/21/2013         5:47 PM   98874    4.479   
996    Phillip    Male  1/31/1984         6:30 AM   42392   19.675   
997    Russell    Male  5/20/2013        12:39 PM   96914    1.421   
998      Larry    Male  4/20/2013         4:45 PM   60500   11.985   
999     Albert    Male  5/15/2012         6:24 PM  129949   10.169   

    Senior Management                  Team  
0                True             Marketing  
1                True                   NaN  
2               False

# 2.Filling Missing Values in Pandas:
Following functions allow us to replace missing values with a specified value or use interpolation methods to find the missing data.<br>
<h1>1. using fillna()</h1>
<b>fillna()</b> used to replace missing values (NaN) with a given value.

In [5]:
#Example 1: Fill Missing Values with Zero:
import pandas as pd
import numpy as np
df = pd.DataFrame({"First_score":[100,88,np.nan,50],
                  "Second":[np.nan,44,np.nan,49]})
print(df.fillna(0))

   First_score  Second
0        100.0     0.0
1         88.0    44.0
2          0.0     0.0
3         50.0    49.0


In [6]:
#Example 2: Forward fill
#The pad method is used to fill missing values with the previous value.
df.ffill()

Unnamed: 0,First_score,Second
0,100.0,
1,88.0,44.0
2,88.0,44.0
3,50.0,49.0


In [7]:
#Example 3: Fill with Next Value (Backward Fill)
#The bfill function is used to fill it with the next value.
df.bfill()

Unnamed: 0,First_score,Second
0,100.0,44.0
1,88.0,44.0
2,50.0,49.0
3,50.0,49.0


In [8]:
#Example 4: Fill NaN Values with 'No Gender'
import pandas as pd
import numpy as np
d = pd.read_csv(url)
#Now we are going to fill all the null values in Gender column with "No Gender".
d.fillna({"Gender":"No_gender"},inplace=True)
print(d[10:25])

   First Name     Gender  Start Date Last Login Time  Salary  Bonus %  \
10     Louise     Female   8/12/1980         9:01 AM   63241   15.132   
11      Julie     Female  10/26/1997         3:19 PM  102508   12.637   
12    Brandon       Male   12/1/1980         1:08 AM  112807   17.492   
13       Gary       Male   1/27/2008        11:40 PM  109831    5.831   
14   Kimberly     Female   1/14/1999         7:13 AM   41426   14.543   
15    Lillian     Female    6/5/2016         6:09 AM   59414    1.256   
16     Jeremy       Male   9/21/2010         5:56 AM   90370    7.369   
17      Shawn       Male   12/7/1986         7:45 PM  111737    6.414   
18      Diana     Female  10/23/1981        10:27 AM  132940   19.082   
19      Donna     Female   7/22/2010         3:48 AM   81014    1.894   
20       Lois  No_gender   4/22/1995         7:18 PM   64714    4.934   
21    Matthew       Male    9/5/1995         2:12 AM  100612   13.645   
22     Joshua  No_gender    3/8/2012         1:58 A

<h1>2. Using replace():</h1>
Use <b>replace()</b> function to replace NaN values with a specific value.

In [9]:
import numpy as np
import pandas as pd
df = pd.read_csv(url)
print(df[10:25])
#replace all NaN values with 0.0.
df.replace(to_replace=np.nan,value=0.0)

   First Name  Gender  Start Date Last Login Time  Salary  Bonus %  \
10     Louise  Female   8/12/1980         9:01 AM   63241   15.132   
11      Julie  Female  10/26/1997         3:19 PM  102508   12.637   
12    Brandon    Male   12/1/1980         1:08 AM  112807   17.492   
13       Gary    Male   1/27/2008        11:40 PM  109831    5.831   
14   Kimberly  Female   1/14/1999         7:13 AM   41426   14.543   
15    Lillian  Female    6/5/2016         6:09 AM   59414    1.256   
16     Jeremy    Male   9/21/2010         5:56 AM   90370    7.369   
17      Shawn    Male   12/7/1986         7:45 PM  111737    6.414   
18      Diana  Female  10/23/1981        10:27 AM  132940   19.082   
19      Donna  Female   7/22/2010         3:48 AM   81014    1.894   
20       Lois     NaN   4/22/1995         7:18 PM   64714    4.934   
21    Matthew    Male    9/5/1995         2:12 AM  100612   13.645   
22     Joshua     NaN    3/8/2012         1:58 AM   90816   18.816   
23        NaN    Mal

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.170,True,0.0
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.340,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
995,Henry,0.0,11/23/2014,6:09 AM,132483,16.655,False,Distribution
996,Phillip,Male,1/31/1984,6:30 AM,42392,19.675,False,Finance
997,Russell,Male,5/20/2013,12:39 PM,96914,1.421,False,Product
998,Larry,Male,4/20/2013,4:45 PM,60500,11.985,False,Business Development


<h1>3. Using interpolate():</h1>
The <b>interpolate()</b> function fills missing values using interpolation techniques such as the linear method.

In [10]:
#Example:
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({"A":[100,39,50,None],
                    "B":[None,43,None,66],
                    "C":[32,55,None,None]})
# interpolate the missing values using Linear method.
#This method ignore the index and consider the values as equally spaced. 
df_1.interpolate(method="linear",limit_direction="forward")

Unnamed: 0,A,B,C
0,100.0,,32.0
1,39.0,43.0,55.0
2,50.0,54.5,55.0
3,50.0,66.0,55.0


# 3.Dropping Missing Values in Pandas
The <b>dropna()</b> function used to removes rows or columns with NaN values.<br>
It can be used to drop data based on different conditions.<br>
<h2>1. Dropping Rows with At Least One Null Value</h2>
Remove rows that contain at least one missing value.

In [11]:
import pandas as pd
import numpy as np
data_2 = pd.DataFrame({'First Score': [100, 90, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score': [52, 40, 80, 98],
        'Fourth Score': [np.nan, np.nan, np.nan, 65]})
print(data_2)
# use dropna()
print(data_2.dropna())

   First Score  Second Score  Third Score  Fourth Score
0        100.0          30.0           52           NaN
1         90.0           NaN           40           NaN
2          NaN          45.0           80           NaN
3         95.0          56.0           98          65.0
   First Score  Second Score  Third Score  Fourth Score
3         95.0          56.0           98          65.0


<h2>2. Dropping Rows with All Null Values:</h2>
We can drop rows where all values are missing using <b>dropna(how='all')</b>

In [12]:
drop_rows_NaN = pd.DataFrame({'First Score': [100, np.nan, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score': [52, np.nan, 80, 98],
        'Fourth Score': [np.nan, np.nan, np.nan, np.nan]})
print(drop_rows_NaN)
# use dropna(how='all')
print(drop_rows_NaN.dropna(how="all"))

   First Score  Second Score  Third Score  Fourth Score
0        100.0          30.0         52.0           NaN
1          NaN           NaN          NaN           NaN
2          NaN          45.0         80.0           NaN
3         95.0          56.0         98.0           NaN
   First Score  Second Score  Third Score  Fourth Score
0        100.0          30.0         52.0           NaN
2          NaN          45.0         80.0           NaN
3         95.0          56.0         98.0           NaN


<h2>3. Dropping Columns with At Least One Null Value:</h2>
To remove columns that contain at least one missing value we use <b>dropna(axis=1)</b>.

In [13]:
dict = {'First Score': [100, np.nan, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score': [52, np.nan, 80, 98],
        'Fourth Score': [60, 67, 68, 65]}
dataframe = pd.DataFrame(dict)
print(dataframe)
#remove columns those have atleast one NaN
print(dataframe.dropna(axis=1))

   First Score  Second Score  Third Score  Fourth Score
0        100.0          30.0         52.0            60
1          NaN           NaN          NaN            67
2          NaN          45.0         80.0            68
3         95.0          56.0         98.0            65
   Fourth Score
0            60
1            67
2            68
3            65


<h2>4. Dropping Rows with Missing Values in CSV Files</h2>
When working with CSV files, we can drop rows with missing values using dropna().

In [14]:
import pandas as pd
da= pd.read_csv(url)
# drop rows those have missing values
nd = da.dropna(axis=0, how='any')
print(nd)

print("Old data frame length:", len(da))
print("New data frame length:", len(nd))
print("Rows with at least one missing value:", (len(da) - len(nd)))

    First Name  Gender Start Date Last Login Time  Salary  Bonus %  \
0      Douglas    Male   8/6/1993        12:42 PM   97308    6.945   
2        Maria  Female  4/23/1993        11:17 AM  130590   11.858   
3        Jerry    Male   3/4/2005         1:00 PM  138705    9.340   
4        Larry    Male  1/24/1998         4:47 PM  101004    1.389   
5       Dennis    Male  4/18/1987         1:35 AM  115163   10.125   
..         ...     ...        ...             ...     ...      ...   
994     George    Male  6/21/2013         5:47 PM   98874    4.479   
996    Phillip    Male  1/31/1984         6:30 AM   42392   19.675   
997    Russell    Male  5/20/2013        12:39 PM   96914    1.421   
998      Larry    Male  4/20/2013         4:45 PM   60500   11.985   
999     Albert    Male  5/15/2012         6:24 PM  129949   10.169   

    Senior Management                  Team  
0                True             Marketing  
2               False               Finance  
3                True

# 2. Removing Duplicate:
Pandas dataframe<b>.drop_duplicates()</b><br>
Syntax:<br>
<b>DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)</b><br>
Parameters:<br>
1. subset: Specifies the columns to check for duplicates. If not provided all columns are considered.<br>
2. keep: Finds which duplicate to keep:<br>
 'first' (default): Keeps the first occurrence, removes subsequent duplicates.<br>
 'last': Keeps the last occurrence and removes previous duplicates.<br>
 False: Removes all occurrences of duplicates.<br>
3. inplace: If True it modifies the original DataFrame directly. If False (default), returns a new DataFrame.<br>
Return type: Method returns a new DataFrame with duplicates removed unless inplace=True.

In [15]:
#example1: simple work of .drop_duplicates()
import pandas as pd
data = pd.DataFrame({"Name":["Alice","Bob","Jhon","Bob"],
                    "Age":[25,21,30,21],
                    "City":["NY","LA","SK","LA"]})
print("original data:\n",data)
clean = data.drop_duplicates(subset=None,inplace=False,keep="first")
print("No duplicate data:\n")
print(clean)

original data:
     Name  Age City
0  Alice   25   NY
1    Bob   21   LA
2   Jhon   30   SK
3    Bob   21   LA
No duplicate data:

    Name  Age City
0  Alice   25   NY
1    Bob   21   LA
2   Jhon   30   SK


<b>dataframe.drop_duplicates() method:</b><br>
<h2>1. Dropping Duplicates Based on Specific Columns:</h2>

In [16]:
#example2: remove duplicate name columns
df_clean = data.drop_duplicates(subset=["Name"])
#Here duplicates are removed only based on the Name column while Age and
#City are ignored for the purpose of removing duplicates.
print(df_clean)

    Name  Age City
0  Alice   25   NY
1    Bob   21   LA
2   Jhon   30   SK


<h2>2. Keeping the Last Occurrence of Duplicates</h2>
By default drop_duplicates() retains the first occurrence of duplicates.<br>
If we want to keep the last occurrence we can use keep='last'.<br>

In [17]:
#example3: keep the last value
keep_last = data.drop_duplicates(keep="last")
#it will remove the index 1 row and keep index 3
print(keep_last)

    Name  Age City
0  Alice   25   NY
2   Jhon   30   SK
3    Bob   21   LA


<h2>3. Dropping All Duplicates:</h2>
If we want to remove all rows that are duplicates.<br>
i.e retain only completely unique rows amd here we can set keep=False.

In [18]:
#example4: drop all duplicates
no_keep = data.drop_duplicates(keep=False)
#it will remove all the dublicates row 
print(no_keep)

    Name  Age City
0  Alice   25   NY
2   Jhon   30   SK


<h2>4. Modifying the Original DataFrame Directly</h2>
* if want to modify the DataFrame in place without creating a new DataFrame set inplace=True.<br>
* Using inplace=True directly modifies the original DataFrame saving memory and avoiding the need to assign the result to a    new variable.

In [19]:
#example5: modify the original value 
import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'David'],
    'Age': [25, 30, 25, 40],
    'City': ['NY', 'LA', 'NY', 'Chicago']
})
df.drop_duplicates(inplace=True)
# df is modified
print(df)

    Name  Age     City
0  Alice   25       NY
1    Bob   30       LA
3  David   40  Chicago


<h2>5. Dropping Duplicates Based on Partially Identical Columns</h2>
Sometimes we might encounter situations where duplicates are not exact rows but have identical values in certain columns.

In [20]:
import pandas as pd

data2 = {
    "Name": ["Alice", "Bob", "Alice", "David", "Bob"],
    "Age": [25, 30, 55, 40, 39],
    "City": ["NY", "LA", "NY", "Chicago", "LA"]
}

df2 = pd.DataFrame(data2)
print("original data:\n",df2)
df_cleaned2 = df2.drop_duplicates(subset=["Name", "City"])

print("after dropping\n",df_cleaned2)
#Here duplicates are removed based on the Name and City columns leaving only unique combinations of Name and City.

original data:
     Name  Age     City
0  Alice   25       NY
1    Bob   30       LA
2  Alice   55       NY
3  David   40  Chicago
4    Bob   39       LA
after dropping
     Name  Age     City
0  Alice   25       NY
1    Bob   30       LA
3  David   40  Chicago


# Pandas Change Datatype:
The most common way to change the data type of a column in a Pandas DataFrame is by using the <b>astype()</b> method.<br>
This method allows you to convert a specific column to a desired data type.

In [21]:
# using astype() method:
import pandas as pd
data3 = pd.DataFrame({"Name":["Jhon","Alice","Bob","Charlie"],
                     "Age":[10,20,30,40],
                     "Gender":["M","M","M","F"],
                     "Salary":[1000,2000,3000,4000]})
#convert 'Age' columns to float data type
data3["Age"]=data3["Age"].astype(float)
print(data3)
print(data3.dtypes)

      Name   Age Gender  Salary
0     Jhon  10.0      M    1000
1    Alice  20.0      M    2000
2      Bob  30.0      M    3000
3  Charlie  40.0      F    4000
Name       object
Age       float64
Gender     object
Salary      int64
dtype: object


<b>Converting a Column to a DateTime Type:</b><br>
Sometimes, a column that contains date information may be stored as a string.<br>
You can convert it to the datetime type using the <b>pd.to_datetime()</b> function.

In [22]:
# example: creat a joinnig time column in data3 in str value
data3["Join Date"] = ['2021-01-01', '2020-05-22', '2022-03-15', '2021-07-30']
# convert it into to_datetime
data3["Join Date"] = pd.to_datetime(data3["Join Date"])
print(data3)
print("types of them:\n",data3.dtypes)

      Name   Age Gender  Salary  Join Date
0     Jhon  10.0      M    1000 2021-01-01
1    Alice  20.0      M    2000 2020-05-22
2      Bob  30.0      M    3000 2022-03-15
3  Charlie  40.0      F    4000 2021-07-30
types of them:
 Name                 object
Age                 float64
Gender               object
Salary                int64
Join Date    datetime64[ns]
dtype: object


<b>change multiple columns data type</b><br>
If you need to change the data types of multiple columns at once, you can pass a dictionary to the astype() method,<br>
where keys are column names and values are the desired data types.

In [23]:
# example cahange age and salary to int and str
data3 = data3.astype({"Age":"int","Salary":"str"})
print(data3.dtypes)

Name                 object
Age                   int32
Gender               object
Salary               object
Join Date    datetime64[ns]
dtype: object


# Drop Empty Columns in Pandas:
<h1>1. Understanding dropna():</h1>
Syntax:<br>
<b>DataFrameName.dropna(axis=0, how='any', inplace=False)</b><br>
Parameters:<br>
<b>axis:</b> axis takes int or string value for rows/columns.<br>
  Input can be 0 or 1 for Integer and ‘index’ or ‘columns’ for String.<br>
<b>how:</b> how takes string value of two kinds only (‘any’ or ‘all’).<br> 
  ‘any’ drops the row/column if ANY value is Null and ‘all’ drops only if ALL values are null.<br>
<b>inplace:</b> It is a boolean which makes the changes in the data frame itself if True.<br>
the examples are given avobe the cell.<br>
<h1>Replace Both Zeros and Empty Strings with Null and Drop Null Columns:</h1>
If a column contains empty strings we need to replace them with NaN before dropping the column.<br>
Empty strings are not automatically recognized as missing values in<br>
Pandas so converting them to NaN ensures they can be handled correctly.<br>

In [24]:
#example:To clean a dataset fully we may need to replace both zeros and empty strings.
import numpy as np
import pandas as pd

df = pd.DataFrame({'FirstName': ['Vipul', 'Ashish', 'Milan'],
                   "Gender": ["", "", ""],
                   "Age": [0, 0, 0]})

df['Department'] = np.nan

nan_value = float("NaN")

# Convert specific columns before replacement
df["Gender"] = df["Gender"].astype(object)
df["Age"] = df["Age"].astype(float)

df.replace(0, nan_value,inplace=True)
df.replace("", nan_value,inplace=True )

df.dropna(how='all', axis=1, inplace=True)

print(df)


  FirstName
0     Vipul
1    Ashish
2     Milan


  df.replace("", nan_value,inplace=True )


# String manipulations in Pandas DataFrame
String manipulation is the process of changing, parsing, splicing, pasting or analyzing strings.<br>
<b>Create a String Dataframe using Pandas</b>

In [25]:
#simple str dataframe:
import pandas as pd
import numpy as np

data = {'Names': ['Gulshan', 'Shashank', 'Bablu', 'Abhishek', 'Anand', np.nan, 'Pratap'],
        'City': ['Delhi', 'Mumbai', 'Kolkata', 'Delhi', 'Chennai', 'Bangalore', 'Hyderabad']}

df = pd.DataFrame(data)
print(df)

      Names       City
0   Gulshan      Delhi
1  Shashank     Mumbai
2     Bablu    Kolkata
3  Abhishek      Delhi
4     Anand    Chennai
5       NaN  Bangalore
6    Pratap  Hyderabad


<h1>String Manipulations in Pandas</h1>
various mathod for manipulations.<br>
<b> 1. str.lower()<br>
    2. str.upper()<br>
    3. str.strip()<br>
    4. str.split()<br>
    5. str.len()<br>
    6. str.cat(sep='')<br>
    7. str.get_dummies()<br>
    8. str.startswith(pattern)<br>
    9. str.endswith(pattern)<br>
    10. str.replace(a,b)<br>
    11. str.repeat(value)<br>
    12. str.count(pattern)<br>
    13. str.find(pattern)<br>
    14. str.findall(pattern)<br>
    15. str.islower()<br>
    16. str.isupper()<br>
    17. str.isnumeric()<br>
    18. str.swapcase()<br>
</b><br>
<b>1. lower():</b> Converts all uppercase characters in strings in the DataFrame to lower case and returns the lowercase strings in the result.<br>

In [26]:
#example lower(): get all names in lowercase
print(df["Names"].str.lower())

0     gulshan
1    shashank
2       bablu
3    abhishek
4       anand
5         NaN
6      pratap
Name: Names, dtype: object


<b>2.upper():</b> Converts all lowercase characters in strings in the DataFrame to upper case and returns the uppercase strings in result.

In [27]:
#example upper(): get all names in uppercase
print(df["Names"].str.upper())

0     GULSHAN
1    SHASHANK
2       BABLU
3    ABHISHEK
4       ANAND
5         NaN
6      PRATAP
Name: Names, dtype: object


<b>3.strip():</b> If there are spaces at the beginning or end of a string, we should trim the strings to eliminate spaces using strip() or remove the extra spaces contained by a string in DataFrame.

In [28]:
#example if their any unnecessary space have in names then delet them.
print(df["Names"].str.strip())

0     Gulshan
1    Shashank
2       Bablu
3    Abhishek
4       Anand
5         NaN
6      Pratap
Name: Names, dtype: object


<b>4.split(' '):</b> Splits each string with the given pattern. Strings are split and the new elements after the performed split operation, are stored in a list.

In [29]:
df['Split_Names'] = df['Names'].str.split('a')
print(df[['Names', 'Split_Names']])

      Names   Split_Names
0   Gulshan    [Gulsh, n]
1  Shashank  [Sh, sh, nk]
2     Bablu      [B, blu]
3  Abhishek    [Abhishek]
4     Anand      [An, nd]
5       NaN           NaN
6    Pratap    [Pr, t, p]


<b>5.len():</b> With the help of len() we can compute the length of each string in DataFrame & if there is empty data in DataFrame, it returns NaN.

In [30]:
#example:
print(df["Names"].str.len())

0    7.0
1    8.0
2    5.0
3    8.0
4    5.0
5    NaN
6    6.0
Name: Names, dtype: float64


<b>6.cat(sep=' '):</b> It concatenates the data-frame index elements or each string in DataFrame with given separator.

In [31]:
print(df)
print("\nafter using cat:\n")
print(df["Names"].str.cat(sep=", "))

      Names       City   Split_Names
0   Gulshan      Delhi    [Gulsh, n]
1  Shashank     Mumbai  [Sh, sh, nk]
2     Bablu    Kolkata      [B, blu]
3  Abhishek      Delhi    [Abhishek]
4     Anand    Chennai      [An, nd]
5       NaN  Bangalore           NaN
6    Pratap  Hyderabad    [Pr, t, p]

after using cat:

Gulshan, Shashank, Bablu, Abhishek, Anand, Pratap


<b>7.get_dummies():</b> It returns the DataFrame with One-Hot Encoded values like we can see that it returns boolean value 1 if it exists in relative index or 0 if not exists.

In [32]:
# use on city
print(df["City"].str.get_dummies())

   Bangalore  Chennai  Delhi  Hyderabad  Kolkata  Mumbai
0          0        0      1          0        0       0
1          0        0      0          0        0       1
2          0        0      0          0        1       0
3          0        0      1          0        0       0
4          0        1      0          0        0       0
5          1        0      0          0        0       0
6          0        0      0          1        0       0


<b>8.startswith(pattern):</b> It returns true if the element or string in the DataFrame Index starts with the pattern.

In [33]:
# give true if the names are start with G
print(df['Names'].str.startswith('G'))

0     True
1    False
2    False
3    False
4    False
5      NaN
6    False
Name: Names, dtype: object


<b>9.endswith(pattern):</b> It returns true if the element or string in the DataFrame Index ends with the pattern.

In [34]:
#gives true if the names end with n
print(df["Names"].str.endswith("n"))

0     True
1    False
2    False
3    False
4    False
5      NaN
6    False
Name: Names, dtype: object


<b>10.Python replace(a,b):</b> It replaces the value a with the value b like below in example 'Gulshan' is being replaced by 'Gaurav

In [35]:
print(df["Names"].str.replace("Gulshan","Gaurav"))

0      Gaurav
1    Shashank
2       Bablu
3    Abhishek
4       Anand
5         NaN
6      Pratap
Name: Names, dtype: object


<b>11.Python repeat(value):</b> It repeats each element with a given number of times like below in example, there are two appearances of each string in DataFrame.

In [36]:
print(df["Names"].str.repeat(2))

0      GulshanGulshan
1    ShashankShashank
2          BabluBablu
3    AbhishekAbhishek
4          AnandAnand
5                 NaN
6        PratapPratap
Name: Names, dtype: object


<b>12.Python count(pattern):</b> It returns the count of the appearance of pattern in each element in Data-Frame like below in example it counts 'n' in each string of DataFrame and returns the total counts of 'a' in each string.

In [37]:
print(df['Names'].str.count('a'))

0    1.0
1    2.0
2    1.0
3    0.0
4    1.0
5    NaN
6    2.0
Name: Names, dtype: float64


<b>13.Python find(pattern):</b> It returns the first position of the first occurrence of the pattern.<br>
We can see in the example below that it returns the index value of appearance of character 'a' in each string throughout the DataFrame.

In [38]:
print(df)
print(df['Names'].str.find('a'))

      Names       City   Split_Names
0   Gulshan      Delhi    [Gulsh, n]
1  Shashank     Mumbai  [Sh, sh, nk]
2     Bablu    Kolkata      [B, blu]
3  Abhishek      Delhi    [Abhishek]
4     Anand    Chennai      [An, nd]
5       NaN  Bangalore           NaN
6    Pratap  Hyderabad    [Pr, t, p]
0    5.0
1    2.0
2    1.0
3   -1.0
4    2.0
5    NaN
6    2.0
Name: Names, dtype: float64


<b>14.findall(pattern):</b> It returns a list of all occurrences of the pattern. As we can see in below, there is a returned list consisting n as it appears only once in the string

In [39]:
import pandas as pd
import numpy as np

data = {'Names': ['Gulshan', 'Shashank', 'Bablu', 'Abhishek', 'Anand', np.nan, 'Pratap'],
        'City': ['Delhi', 'Mumbai', 'Kolkata', 'Delhi', 'Chennai', 'Bangalore', 'Hyderabad']}

df = pd.DataFrame(data)
print(df)
print(df["Names"].str.findall("a"))

      Names       City
0   Gulshan      Delhi
1  Shashank     Mumbai
2     Bablu    Kolkata
3  Abhishek      Delhi
4     Anand    Chennai
5       NaN  Bangalore
6    Pratap  Hyderabad
0       [a]
1    [a, a]
2       [a]
3        []
4       [a]
5       NaN
6    [a, a]
Name: Names, dtype: object


<b>15.islower():</b> It checks whether all characters in each string in the Index of the Data-Frame in lower case or not, and returns a Boolean value.

In [40]:
print(df["Names"].str.islower())

0    False
1    False
2    False
3    False
4    False
5      NaN
6    False
Name: Names, dtype: object


<b>16.isupper():</b> It checks whether all characters in each string in the Index of the Data-Frame in upper case or not, and returns a Boolean value.

In [41]:
print(df["Names"].str.isupper())

0    False
1    False
2    False
3    False
4    False
5      NaN
6    False
Name: Names, dtype: object


<b>17.isnumeric():</b> It checks whether all characters in each string in the Index of the Data-Frame are numeric or not, and returns a Boolean value.

In [42]:
print(df["Names"].str.isnumeric())

0    False
1    False
2    False
3    False
4    False
5      NaN
6    False
Name: Names, dtype: object


<b>18.swapcase():</b> It swaps the case lower to upper and vice-versa. Like in the example below, it converts all uppercase characters in each string into lowercase and vice-versa (lowercase -> uppercase).

In [43]:
print(df["Names"].str.swapcase())

0     gULSHAN
1    sHASHANK
2       bABLU
3    aBHISHEK
4       aNAND
5         NaN
6      pRATAP
Name: Names, dtype: object


# Pandas: Detect Mixed Data Types and Fix it
* What are mixed types in Pandas columns?
  => When any column of the Pandas data frame doesn't contain a single type of data, either numeric or string, but contains      mixed type of data, both numeric as well as string, such column is called a mixed data type column.<br>
  For example:<br>
   data_frame = pd.DataFrame( [['tom', 10], ['nick', '15'], ['juli', 14.8]], columns=['Name', 'Age'])<br>
* Causes of mixed data types<br>
<b>Missing Values (NaN)<br>
Inconsistent Formatting<br>
Data Entry Errors</b><br>
<h1> identify mixed types in Pandas columns:</h1>
For detecting the mixed data types, you need to traverse each column of Pandas data frame, and get the data type using<br> <b>api.types.infer_dtypes() function.</b><br>
Syntax:<br>
<b>for column in data_frame.columns:<br>
   print(pd.api.types.infer_dtype(data_frame[column]))</b>
Here,<br> 
data_frame: It is the Pandas data frame for which you want to detect if it has mixed data types or not.

In [44]:
#example: Python program to detect mixed data types in Pandas data frame.
import pandas as pd
# create dataframe
data_frame = pd.DataFrame([["Tom",12],["Nick","10"],["Bob",15.5]],columns=["Name","Age"])
print(data_frame)
# detect mixed data
for column in data_frame.columns:
    print(column,":",pd.api.types.infer_dtype(data_frame[column]))

   Name   Age
0   Tom    12
1  Nick    10
2   Bob  15.5
Name : string
Age : mixed-integer


<h1>dealing with mixed types in Pandas columns:</h1>
For fixing the mixed data types in Pandas data frame, you need to convert entire column into one data type.<br>
This can be done using <b>astype()</b> function or <b>to_numeric()</b> function.

<h3>Using astype() function:</h3>
Syntax:<br>
<b>data_frame[column] = data_frame[column].astype(int)</b> 
Here,<br> 
<b>data_frame:</b> It is the Pandas data frame for which you want to fix mixed data types.<br>
<b>column:</b> It defines all the columns of the Pandas data frame.<br>
<b>int:</b> Here, int is the data type in which you want to transform type of each column of Pandas data frame. You can also use str, float, etc. here depending on which data type you want to transform. 

In [45]:
# Transfering mixed data types into single data types
data_frame["Age"]=data_frame["Age"].astype(int)
# Traverse data frame to detect data types after fix
for column in data_frame.columns:
    print(column,":",pd.api.types.infer_dtype(data_frame[column]))
print(data_frame)

Name : string
Age : integer
   Name  Age
0   Tom   12
1  Nick   10
2   Bob   15


<h3>Using to_numeric() function:</h3>
The to_numeric() function is used to convert an argument to a numeric data type. In this way, we will see how we can fix mixed data types using to_numeric() function.<br>
Syntax:<br>
<b>data_frame[column] = data_frame[column].apply(lambda x: pd.to_numeric(x, errors = 'ignore'))</b>
Here,<br>
<b>data_frame:</b> It is the Pandas data frame for which you want to fix mixed data types.<br>
<b>column:</b> It defines all the columns of the Pandas data frame.

In [46]:
# example:
import pandas as pd 
data_frame2 = pd.DataFrame([["Alice",33],["Libi","44"],["Jimi",42.5]],columns=["Name","Age"])
print("before test:\n",data_frame2)
# check mixed data type
for column in data_frame2.columns:
    print(column,":",pd.api.types.infer_dtype(data_frame2[column]))

# fixed this problem with to_numeric method
# Though apply(lambda x: pd.to_numeric(x,errors="ignore")) can work but there have another method 
# which more convinent 
## Apply pd.to_numeric without 'errors' and handle exceptions explicitly
def safe_convert(x):
    try:
        return pd.to_numeric(x)
    except Exception:
        return x
# now use this function on the apply        
data_frame2["Age"] = data_frame2["Age"].apply(safe_convert)
# check again:
for column in data_frame2.columns:
    print(column,":",pd.api.types.infer_dtype(data_frame2[column]))
# final show
print("\nafter fixed:\n",data_frame2)

before test:
     Name   Age
0  Alice    33
1   Libi    44
2   Jimi  42.5
Name : string
Age : mixed-integer
Name : string
Age : floating

after fixed:
     Name   Age
0  Alice  33.0
1   Libi  44.0
2   Jimi  42.5


<b>Conclusion:</b>
Pandas columns with mixed types can cause problems when analyzing data, but they can be found and resolved using the<br> techniques in this article. Data scientists and software developers can guarantee the accuracy and dependability of their<br> analysis by properly cleaning and preparing the data.