# **Mixed Data (number/String/mixed)**



**Learning objectives:**


In this chapter you will learn the following topics:

- Extracting data of different datatypes in a raw data/mixed data.

**Technical Requirements**

In this chapter, you will use the pandas library.
To begin this chapter, readers should have familiarity with the pandas library.

__Introduction__

Raw data are mostly unstructured or unformatted. To take out useful information from them is tidious task. 

In this section, you will learn how to extract useful data from raw data using Regular Expression.

To test Regular Expression (so called RegEx) use : [RegEx](https://ryanstutorials.net/regular-expressions-tutorial/regular-expressions-cheat-sheet.php)

Before moving into regular expression implementation, let's look at what regular expression looks like first.

__Regular Expression__

A regex, or a regular expression is a sequence of characters and special metacharacters used
to match a set of character strings. Regular expressions allow you to be more expressive
with string-matching operations than just providing a simple substring. You can think of it
as a "pattern" that you want to match with strings of different lengths, made up of different
characters.

**What are those metacharacters?**


Here is a list of basic metacharacters, and what they do:

- ".": The period is a metacharacter that matches any character other than a
newline


- "[ ]": Square brackets specify a set of characters to match

- "( )": Parentheses in regular expressions are used for grouping and to enforce
the proper order of operations, just as they are used in math and logical
expressions

- "*": An asterisk matches 0 or more copies of the preceding character.
- "?": A question mark matches 0 or 1 copy of the preceding character.
- "+": A plus sign matches 1 or more copies of the preceding character.
- "{ }": Curly braces match a preceding character for a specified number of
repetitions

Regular expressions include several special character sets that allow us to quickly specify
certain common character types. They include:


- [a-z]: Match any lowercase letter.
- [A-Z]: Match any uppercase letter.
- [0-9]: Match any digit.
- [a-zA-Z0-9]: Match any letter or digit

Python regular expressions also include a shorthand for specifying common
sequences:

- \d: Match any digit.
- \D: Match any non-digit.
- \w: Match a word character.
- \W: Match a non-word character.

For more information and examples visit [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html)


**Example:**

In [0]:
# Create a dataframe with a single column of strings
import pandas as pd
data = {'raw': ['Ram 1 2020-02-04     125.1',
                'Shyam 1 2019-03-11       65121.7',
                'Rupa 0 2019-04-12       456.2',
                'Sita 0 2019-12-01      445.6',
                'Hari 1 2020-01-03       4.4',
                'Gita 0 2020-07-28       245.6']}
df = pd.DataFrame(data, columns = ['raw'])
df

Unnamed: 0,raw
0,Ram 1 2020-02-04 125.1
1,Shyam 1 2019-03-11 65121.7
2,Rupa 0 2019-04-12 456.2
3,Sita 0 2019-12-01 445.6
4,Hari 1 2020-01-03 4.4
5,Gita 0 2020-07-28 245.6


*The raw data in dataframe, i.e., the raw column contains string type data, integer type data, float type data, and DateTime type. We are going to extract every kind in a separate column.*

In [0]:
# In the column 'raw', extract Date type in date column
df['date'] = df['raw'].str.extract('(....-..-..)', expand=True)

_In the column 'raw', extract Date type in date column._

In [0]:
# In the column 'raw', extract float values  in score column
#  [+-]?(\-?\d+\.\d+)
df['score'] = df['raw'].str.extract('[+-]?(\-?\d+\.\d+)', expand=True)

_Extract float values from column raw._

In [0]:
#  In the column 'raw', extract catagorical integer value in gender column
df['gender'] = df['raw'].str.extract('(\d)', expand=True)

_Extract categorical value from raw column_

In [0]:
# In the column 'raw', extract the word in name column
df['Name'] = df['raw'].str.extract('([A-Z]\w{0,})', expand=True)

_Extract string values from column raw._

In [0]:
df.head(2)

Unnamed: 0,raw,date,score,gender,Name
0,Ram 1 2020-02-04 125.1,2020-02-04,125.1,1,Ram
1,Shyam 1 2019-03-11 65121.7,2019-03-11,65121.7,1,Shyam


_Drop the column raw._

In [0]:
df.drop('raw', axis=1, inplace=True) # drop raw column 
df.head(2)

Unnamed: 0,date,score,gender,Name
0,2020-02-04,125.1,1,Ram
1,2019-03-11,65121.7,1,Shyam


*We successfully extracted all types in given raw data.*

*Let's see datatype of different columns in newly created dataframe*

In [0]:
df.dtypes # 

date      object
score     object
gender    object
Name      object
dtype: object

*Change individual columns datatype*

In [0]:
df.date = df.date.astype('datetime64[ns]')
df.score = df.score.astype('float')
df.gender = df.gender.astype('category')
df.Name = df.Name.astype('str')

In [0]:
df.dtypes # show datatype

date      datetime64[ns]
score            float64
gender          category
Name              object
dtype: object

# **Key take away**

* Regx is used to extract meaningful data from raw text data.

