# **Dirty Data Diagnosis Workflow Guide - Numerical**

### Welcome to the Workflow Guide for Diagnosing Dirty Numerical Data!

In this workflow guide, we will explore how we can diagnose dirty numerical data using datalabx.

### **Importing Libraries**

To begin with, we will be importing:

- datalabx

In [1]:
import datalabx

### **Loading the Data**

In this Guide, we will be using an extremely dirty synthetic however, realistic numerical dataset.

In [2]:
from datalabx import load_tabular

df = load_tabular('ultra_messy_dataset.csv')
df.head()

Unnamed: 0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
0,five,1.04e+05,4735.244618878169,"1,55e+02",,unknown,,30.70,4.888018131992931,64972
1,20.673408544550288,134460.66741276794,9533.158420128186,"1,53e+02",12193,2.05e+01,2400.0,one,"3,33e+00","$58,276.81"
2,33.56,missing,4104.95,204,96.50051410846052,"-1,14e+01",2877.0871437672777,four,unknown,94033.3425007563
3,missing,,894429.0,196.34 cm,?,-1.57e+01,,?,4.31,"27,400cm"
4,28$,112877.67251785153,1040.45,171.40 cm,4912,-6,4344.39,2.24,4.14,67682


### **Checking Datatypes**

This step is quite important because:

- Beginning with **v0.1.0b4**, datalabx loads a pandas DataFrame with all columns with **string** datatype.
- This step ensures that Dirty Data detection can be easier and consistent to work with.

In [3]:
df.dtypes

Age                object
Salary             object
Expenses           object
Height_cm          object
Weight_kg          object
Temperature_C      object
Purchase_Amount    object
Score              object
Rating             object
Debt               object
dtype: object

Great! 

We can see that all columns are of string datatype.

### **Diagnosing Dirty Data**

Using datalabx, we can explore several kinds of inconsistencies and formatting issues that may be visible or invisible in a dataset. 

In order to explore issues or dirt in our data, we can import ``DirtyDataDiagnosis`` class.

In [4]:
from datalabx import DirtyDataDiagnosis

#### **Diagnosing Dirty Numerical Data**

Since datalabx allows us to diagnose numerical data separately, we can diagnose issues in a dirty numerical dataset using ``diagnose_numbers()`` method from the **DirtyDataDiagnosis** class.

We can also see what all diagnostic methods we have available in ``diagnose_numbers()`` method, by passing **show_available_methods=True**.


In [5]:
diagnosis = DirtyDataDiagnosis(df).diagnose_numbers(show_available_methods=True)

DirtyDataDiagnosis - INFO - Dirty Data Diagnosis initialized with auto backend.
DirtyDataDiagnosis - INFO - Available diagnostic methods: ['is_valid', 'is_dirty', 'is_text', 'is_symbol', 'is_missing', 'is_scientific_notation', 'has_units', 'has_symbols', 'has_commas', 'has_currency', 'has_multiple_decimals', 'has_multiple_commas', 'has_spaces', 'has_decimals', 'has_text']


As we can see that after using **show_available_methods=True**, we can see a whole list of diagnostics, which is helpful for reference.




#### **Exploring Dirty Numerical Data**

All Dirty Data Diagnosis methods follow the following pattern for ease of access:

``diagnosis[column_name][diagnostic_method]``

Let us look at an example of that:


In [6]:
diagnosis['Age']['is_dirty'].head(5)

Unnamed: 0_level_0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,five,1.04e+05,4735.244618878169,"1,55e+02",,unknown,,30.70,4.888018131992931,64972
3,missing,,894429,196.34 cm,?,-1.57e+01,,?,4.31,"27,400cm"
4,28$,112877.67251785153,1040.45,171.40 cm,4912,-6,4344.39,2.24,4.14,67682
5,,"97,011cm",3940.648823758793,193,72.02,approx 1000,4680.0,?,1.4870053742777882,
6,"1,62e+00",,?,188.38,13W,-17.10,4939.82,"1,37e+01",approx 1000,4997988


As we can see in the above example that we get a DataFrame of rows where values are dirty in the selected column, and also corresponding values in other columns.

We will now explore all the available diagnostic methods to see how datalabx helps in detecting dirty data.

##### 1. **is_valid** -> Numbers containing only valid integers or decimals

In [7]:
diagnosis['Salary']['is_valid']

Unnamed: 0_level_0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,20.673408544550288,134460.66741276794,9533.158420128186,"1,53e+02",12193,2.05e+01,2.40e+03,one,"3,33e+00","$58,276.81"
8,19.668785230642992,113743,,209cm,"7,41e+01",-047,,approx 1000,1.76,7009614
16,approx 1000,14323039,669239,200.21752547959585,,-5.95023300841531,2805.74,53,3.55e+00,60.441
17,80.26378710785808,37520,6196,17096,8e,error,108278,7715,2.0633347862257705,"37,569$"
20,31.76114435883044,84341.40047685153,375313,157.54 cm,45.687369632176015,,two,unknown,193,one
...,...,...,...,...,...,...,...,...,...,...
99975,?,60113.35587059199,"5,927#",?,,2.60e+01,4837,2822,1.7935389974553844,"2,18e+04"
99985,5.31e+01,41.000,?,,,five,40847,24.45,,?
99986,86,91419.30329171919,"$1,552.59",one,unknown,-1440,"USD4,000.28",60.42,415,"6,054$"
99996,4.67e+01,44685,8926.38,169.24 cm,?,-1.87,3.57e+03,"1,53e+01",3.06,one


We can see:

- Rows containing valid data in selected column **'Salary'** along with corresponding values in other columns.

- That in the selected column **'Salary'** at **row 99985**, we have **41.000**.

This means converting data like this directly into numbers may not be a right choice, and requires careful cleaning.

However, in this guide we are only exploring how we can diagnose such hostile data with datalabx.

##### 2. **is_dirty** -> Values that are not strictly numeric

In [8]:
diagnosis['Salary']['is_dirty']

Unnamed: 0_level_0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,five,1.04e+05,4735.244618878169,"1,55e+02",,unknown,,30.70,4.888018131992931,64972
2,33.56,missing,4104.95,204,96.50051410846052,"-1,14e+01",2877.0871437672777,four,unknown,94033.3425007563
4,28$,112877.67251785153,1040.45,171.40 cm,4912,-6,4344.39,2.24,4.14,67682
5,,"97,011cm",3940.648823758793,193,72.02,approx 1000,4.68e+03,?,1.4870053742777882,
7,4748,12235546,,206.76284108092278,unknown,3061,-3813.6973731160588,10$,,
...,...,...,...,...,...,...,...,...,...,...
99992,17cm,79100.25,7126.922194644891,"1,63e+02",56,five,approx 1000,,3.7461261651622686,"6,42e+04"
99993,unknown,one,519kg,168.45,1.24e+02,13.27 C,2676,?,,48997
99995,58.46412848279316,17167006,4.74e+03,error,?,-1803,3K60,40,1.18,4799391
99998,73.77945243627492,"134,820cm",,,124,49.24252868950967,53,92.33288196735303,2.57,56363.25637056529


We can see:

- Rows containing dirty data in selected column **'Salary'** along with corresponding values in other columns.

- At **rows 4 and 99992**, we have values that look like valid numbers, so why are they detected as dirty?

Well, we will exploring that next.

##### 3. **has_spaces** -> Numbers with leading or trailing spaces

In [9]:
diagnosis['Salary']['has_spaces']

Unnamed: 0_level_0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
4,28$,112877.67251785153,1040.45,171.40 cm,4912,-6,4344.39,2.24,4.14,67682
12,56.54658961646852,117093.88,280,209.39 cm,10830,4710,39,?,,missing
24,14.804404681617523,163626.39,,one,,0.15,2594.3002316877773,2,3,47772
26,,177863.96,6907.200431442799,192.04280102035187,,unknown,,,1$,"23,704cm"
29,,150576.34139633976,9701.922363121375,181.51512589672436,13380,-1378,1489,3.5289259277817897,3.85,74363.81
...,...,...,...,...,...,...,...,...,...,...
99968,077,159898.19,,19630,three,49.36,1.01e+03,error,,missing
99976,,149996.29,four,149,13483,42,"2,278$",54.77,?,3445133
99983,17.36,72039.32150316115,3166.4743189371625,159.47231001453957,62,26,"4,846#",,2.60,54826
99989,one,85135.0577649094,"£8,970.84",154.68503873178864,four,,86498,84.9798232754922,,6057819


We can see:

- That **rows 4 and 99992** were detected as **dirty** because they had spaces, even though they were valid numbers.

Let us verify that by checking representation of data in column **Salary**.

In [10]:
spaces_in_salary_column = diagnosis['Salary']['has_spaces']

spaces_in_salary_column['Salary'].apply(repr)

index
4        '  112877.67251785153  '
12                   '117093.88 '
24                   '163626.39 '
26                   '177863.96 '
29       '  150576.34139633976  '
                   ...           
99968                '159898.19 '
99976                '149996.29 '
99983     '  72039.32150316115  '
99989      '  85135.0577649094  '
99992                 '79100.25 '
Name: Salary, Length: 12461, dtype: object

We can verify:

- That we have leading and trailing spaces in our data.
- Valid numbers containing dirt were correctly identified as **'dirty'**, to prevent errors and failures.

##### 4. **is_text** -> Values containing only alphabetic characters and spaces

In [11]:
diagnosis['Salary']['is_text']

Unnamed: 0_level_0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2,33.56,missing,4104.95,204,96.50051410846052,"-1,14e+01",2877.0871437672777,four,unknown,94033.3425007563
13,?,unknown,"USD1,968.25",140,84.77 kg,-13.963460658813508,2615,70.0010216186228,201,528H1
19,24,error,missing,154.39347845753835,1d1,-11.799573772015473,four,"8,65e+01",2.72,99873.53508502501
21,10.63,missing,"4,858$",158,131.57 kg,,322779,2.7084178606550324,approx 1000,6889551
41,"3,47e+01",one,7849.991870917486,?,146,"-1,23e+01",2470,five,4.74,?
...,...,...,...,...,...,...,...,...,...,...
99945,65,error,7733.017450079798,183,,43.98 C,3296.8437727603455,"4,68e+01",3.4208849180608993,"4,50e+03"
99949,8.70,five,12m6,approx 1000,4.94e+01,3,3879.643378469343,81.50471115201124,,92682.22547475726
99963,approx 1000,five,4.651,179.4574667294149,114.48668243994786,4445,"4,129cm",69.50,3,"£81,550.43"
99977,30,four,8510.975305875856,,G6,,312300,4.84e+01,4.61,"20,083#"


We can see:

- Rows of data containing only the text values present in selected column. 
- Presence of **numbers in the form of text**. E.g: ``['one', 'four', 'five']``
- Presence of **missing data placeholders**. E.g: ``['missing', 'unknown', 'error']``




##### 5. **is_symbol** -> Values that are only symbols **(anything not text or number).**

In [12]:
diagnosis['Expenses']['is_symbol']

Unnamed: 0_level_0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
6,"1,62e+00",,?,188.38,13W,-17.10,4939.82,"1,37e+01",approx 1000,4997988
22,51,9.19e+04,?,19414,82.92 kg,1.6457943767645133,$395.47,48.849217370187525,277,61514.218112467155
25,8.90e+01,158192,?,151,four,four,"£2,337.09",,-3.3620502805540093,69092
85,approx 1000,"67,169cm",?,199.07271353242862,12266,30.77,5528,40.49,,6838.111785109047
87,"5,34e+01",3077053,?,,,42.95 C,382.21031260890965,96.74,G,
...,...,...,...,...,...,...,...,...,...,...
99942,,119212,?,error,48kg,,4572.038038420724,,4,4834642
99952,four,approx 1000,?,185.46537031537326,,-14.319478982819899,"4,25e+03",five,4.66,51951
99978,12.29,?,?,1.59e+02,12947,,1741.6447996989257,15.464122066700325,2,1768431
99985,5.31e+01,41.000,?,,,five,40847,24.45,,?


We can see:

- Rows of data containing only symbols present in selected column. 
- Presence of symbols like **'?'**.

##### 6. **is_scientific_notation** -> 	Numbers expressed in scientific notation

In [13]:
diagnosis['Salary']['is_scientific_notation']

Unnamed: 0_level_0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,five,1.04e+05,4735.244618878169,"1,55e+02",,unknown,,30.70,4.888018131992931,64972
22,51,9.19e+04,?,19414,82.92 kg,1.6457943767645133,$395.47,48.849217370187525,277,61514.218112467155
48,?,6.34e+04,-3969.8980404480108,197.30,147$,14.587490438423828,363.54132112663615,90.95,480,34511.48105698926
83,53.81,"1,55e+05",,144,63$,1.07e+00,3X98,2.62,1.5582989345477651,"£18,067.59"
91,,"4,24e+04",6223.7532314998525,184.82,?,three,1.72e+03,77.33,3.0364374933598874,1698990
...,...,...,...,...,...,...,...,...,...,...
99916,1.76e+01,"1,83e+05",?,,approx 1000,33,2206,77cm,2.9472687090768925,"£66,390.88"
99934,30.53,"1,00e+05",,one,x26,-5.20507318092902,,65,one,77913.45969225027
99937,,3.50e+04,4921.773488771688,?,,43,£393.92,missing,1kg,"€29,603.02"
99981,5,"1,89e+05",183.95536021312876,"1,56e+02",missing,,673.6215011527486,25#,,"€1,760.30"


We can see:

- Rows containing values that are expressed in scientific notation, in the selected column.
- Scientific notation values that have decimal points. E.g: ``1.04e+05``
- Scientific notation values that have commas. E.g: ``1,55e+05``

**NOTE**:

> Even though scientific notations are not dirty numerical data, they just are not the usual way of seeing numbers.

##### 7. **is_missing** -> 	Values that are pandas built-in missing values **(NaN, None or NaT)**

In [14]:
diagnosis['Salary']['is_missing']

Unnamed: 0_level_0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
3,missing,,894429,196.34 cm,?,-1.57e+01,,?,4.31,"27,400cm"
6,"1,62e+00",,?,188.38,13W,-17.10,4939.82,"1,37e+01",approx 1000,4997988
23,53.17,,2375.26,?,4529,-3.60,three,,-1.5500259276486759,546.9687378848231
38,63.845197905025486,,3.39e+03,two,137cm,33,three,?,1.549877485863596,4.20e+02
40,4170,,-9163.295252169082,158.49341835731576,49.97026332576549,-5.339405691908919,341.7384991212715,6m,1.4979516300582563,
...,...,...,...,...,...,...,...,...,...,...
99967,1331,,6135.652593031973,149.35239389417842,"4,83e+01",,1726.3862998257182,?,?,X7977
99972,78.65500787712726,,65g2,185,70.26295135296155,-2.6470571433522068,,94,1.195301407321253,unknown
99979,1792,,4002.7337333716923,178#,"1,02e+02",-7.49 C,approx 1000,59,3.9279059376617296,?
99982,?,,4574.147201956746,?,,-6.743289763755037,513.3439717515614,13.55,335,"5,39e+04"


We can see:

- Rows containing pandas built-in missing values in the selected column.
- Presence of **None**.

##### 8. **has_commas** -> Numbers that contain single commas.

In [15]:
diagnosis['Salary']['has_commas']

Unnamed: 0_level_0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
7,4748,12235546,,206.76284108092278,unknown,3061,-3813.6973731160588,10$,,
14,89,19115589,6110.901653052778,20345,13296,49,1602.82,89.72625151867503,1,
44,37,19301884,"7,32e+03",146,111#,-29.004856251571162,?,56.17,,83693
47,?,17354524,one,206.86,cc,-1707,"2,86e+03",?,4.44,2630479
54,7.406576281928611,9919411,unknown,169.55197193077123,61.02,18.48512229150878,,,4,?
...,...,...,...,...,...,...,...,...,...,...
99950,,91256,7315.909551677584,186.40,?,23,4455,99kg,4,-24727.72113548849
99980,49.992466371566216,10557082,6.73e+03,two,5243,47.46,four,2$,2.7478660020037577,
99991,-73.86346439365516,17051439,,157kg,,16.48352001523763,1144,-52.52890051132507,3.05,1771974
99995,58.46412848279316,17167006,4.74e+03,error,?,-1803,3K60,40,1.18,4799391


We can see:

- Rows containing numbers with commas in the selected column.
- Commas being used as thousand separator. E.g: ``91,256``
- Commas being used as decimal replacement. E.g: `99194,11`

##### 9. **has_units** -> Numbers suffixed with alphabetical units

In [16]:
diagnosis['Salary']['has_units']

Unnamed: 0_level_0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5,,"97,011cm",3940.648823758793,193,72.02,approx 1000,4.68e+03,?,1.4870053742777882,
9,?,"114,154kg",3886.04358926301,,67cm,4.19e+01,,?,four,
11,5405,"84,570kg",3565,19178,90,three,4.01e+03,-46.50775056319312,,767558
18,7794,"167,059kg",3305.7506022186312,140.0933932430345,59,unknown,?,17.66,2,"€1,365.37"
85,approx 1000,"67,169cm",?,199.07271353242862,12266,30.77,5528,40.49,,6838.111785109047
...,...,...,...,...,...,...,...,...,...,...
99890,8652,11878g,7194,204.99880867126944,123.98636363182037,,274455,-55.50372598636834,279,"€85,266.94"
99926,,"174,203kg",5565,,,,404529,87.87463218920847,3.7374133373046017,8858810
99943,74.4200605300671,"42,646kg",494.5423936608717,,,-Q8,820.115047444307,,g,7875
99988,error,"54,060cm",unknown,195.1132301157852,5580,-1.45 C,269.98156504071824,82.24745040715497,4.0062028980917335,89986


We can see:

- Rows of numbers containing units like `'cm'`, `'kg'` or `'C'`.

- Presence of units in **Salary** column, which should not be the case.

##### 10. **has_symbols** -> Numbers containing non-alphanumeric or special symbols

In [17]:
diagnosis['Expenses']['has_symbols']

Unnamed: 0_level_0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
72,unknown,157130,"1,584#",201,?,3.39,,two,3kg,"8,50e+04"
133,five,,"7,489#",191.1978074857144,error,2822,,,271,?
288,,174587.84885026535,613#,1.95e+02,40.66640021440021,-3.686203201619353,3469.1026930108605,50.66165359258081,approx 1000,4412Y
327,25.430754502321705,176320.83,"5,064#",approx 1000,42.96,-17.696365801547106,331480,65,1,"$47,926.99"
379,three,"$21,861.45","5,136#",,123.14 kg,,,,five,"£16,680.45"
...,...,...,...,...,...,...,...,...,...,...
99787,94.42,6696u,"2,437#",184kg,,48.176052012252626,"1,301cm",5471,1.05,
99796,53.591163687381105,2.70e+04,"8,440#",184.89,109.10,8,3860.263958816173,96.60,"1,76e+00",6.43e+03
99871,?,32672.15,"2,385#",?,81,36.77,,5,4,
99883,missing,47814.02,"2,442#",189.36,72.19,29,?,three,4,13874.71


We can see:

- Rows containing numbers with symbols in the selected column.

- Presence of symbols like ``','`` and ``#`` at the middle or end of numbers.

##### 11. **has_currency** -> Numbers containing currency symbols **(start or end)**

In [None]:
diagnosis['Salary']['has_currency']

Unnamed: 0_level_0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
10,,"£174,199.87",4391.802671782378,185.76848542242396,14987,33.94610237955915,3.43e+03,five,2.46,"75,920cm"
31,30.509111667917633,"£56,122.72",775483,172,104.15 kg,4815,,4.79e+01,4.940423406622413,43176.67103678792
68,?,"€187,848.32",993$,five,60.907042770125216,46.41,?,81$,1.1427688707982004,
76,3,"€70,388.25",8.45e+03,,115.9444747337286,-16$,"4,91e+03",3751,2.27e+00,
103,97.72722836210654,"£183,864.54",9354.24,186.01437317034862,9h,35.30,-3295.56172482079,19.444189640075727,1.2066048806475855,61839.53758801937
...,...,...,...,...,...,...,...,...,...,...
99893,59.45,"101,638$",3952.90,error,86#,14.82357384827251,,,359,13961.242334451885
99927,?,"£148,827.03","3,92e+03",141.04,73.53,,?,two,1.3382789658845238,6913119
99935,90.47,"€103,742.85",one,154kg,five,-14kg,4.774,59,,3.53e+04
99960,5.00e+01,"€75,166.97",unknown,,55kg,"-3,97e+00",one,?,3.20,


We can see:

- Rows containing numbers with currency symbols in the selected column.
- Presence of currency symbols at beginning. E.g: **[£174,199.87, €187,848.32]**
- Presence of currency symbols at end. E.g: **(94,635$)**

##### 12. **has_text** -> Numbers containing text

In [19]:
diagnosis['Salary']['has_text']

Unnamed: 0_level_0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
59,13.515496581453446,"USD111,650.31",,,98.97,-2kg,2439.8053177914226,7G,284,"7,37e+04"
80,unknown,approx 1000,6113.738800120347,14401,47,11$,"USD1,137.55",approx 1000,four,3308906
116,error,15O656,5.49e+03,16964,5471,missing,657.0642569987274,-60.629991444131136,,one
128,64.92559276385012,approx 1000,"1,660$",missing,77.23 kg,13.44 C,error,,three,44372.86947464095
157,37.30,approx 1000,2.83e+03,143.09249712048197,,two,"3,32e+03",,?,"€99,167.87"
...,...,...,...,...,...,...,...,...,...,...
99876,three,XX940,67P8,19044,51.485878574281934,-707,3557.5681292097497,11.040612809705896,missing,missing
99917,44.99271101319163,approx 1000,380165,140.4200740287813,135,,154.09,?,2.3625033484864897,unknown
99952,four,approx 1000,?,185.46537031537326,,-14.319478982819899,"4,25e+03",five,4.66,51951
99973,25.70,A43A05,9869.054693635251,190.25 cm,,error,4354,1956,2.4983797333325866,"36,870#"


We can see:

- Rows containing numbers with text in the selected column.
- Presence of alphabetical currencies like USD in numbers. E.g: ``(USD111,650.31)``
- Presence of text like **'approx'** in numbers. E.g: ``(approx 1000)``
- Presence of random text in numbers. E.g: ``(116c48)``

##### 13. **has_multiple_commas** -> Numbers containing more than one comma

In [21]:
diagnosis['Salary']['has_multiple_commas']

Unnamed: 0_level_0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
39,85.39119238523617,11150069,430039,156.51,113.42847314651515,8#,280.50,63.23,3.5016909340924096,86593.9885648507
214,?,3514026,2114,,five,17.16526560551941,4152.84,,,1513.1369537517814
400,five,17694427,,188.13323368825326,106.20531559751271,,266.02,,2.393853010731272,
442,68,13060942,2284.15,204.64,127.66946046843346,7.622790296831802,one,5.10e+01,3$,"6,39e+04"
506,2209,7241262,2094,1.91e+02,,?,328045,?,3.8286487554519573,7646
...,...,...,...,...,...,...,...,...,...,...
99625,error,8618236,"6,276#",194,141.39,45,"3,549#",46.86,2cm,"14,728#"
99719,0,8433953,,198.34,41,027,1.03e+03,95.9725406926238,1.30,missing
99762,67,2216894,,165.87,,44.24835973546875,"3,75e+03",13,1.5038496235709644,8.45e+04
99959,,10297163,296g,179.95,,-1cm,,,4.59e+00,?


We can see:

- Rows containing multiple commas in numbers, in the selected column.
- Presence of multiple commas, both - **thousand separators** and **decimal replacements** in a single value.. E.g: ``(176,944,27)``
- Presence of multiple commas is valid if commas are used as thousand separators, not as decimal replacements in a single value.

Usually, valid numbers with multilple commas are represented like this -> ``1,000,555.23``

##### 14. **has_multiple_decimals** -> Numbers containing more than one decimal point.

In [22]:
diagnosis['Salary']['has_multiple_decimals']

Unnamed: 0_level_0,Age,Salary,Expenses,Height_cm,Weight_kg,Temperature_C,Purchase_Amount,Score,Rating,Debt
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
36,93.01,35.343.06,"5,657kg",?,69.16754859811246,-11,3962.8835885375865,2978,1,23757
192,57.88,149.918.17,"7,825$",186#,40.09 kg,-8.571189561945724,?,4,missing,91852.42585796413
827,?,190.714.89,6164.42,150.21944136513636,138,0.90,20170,68.34592378360729,2$,75871.06754567151
992,2.41e+01,163.835.46,5312.81,1.65e+02,65.51 kg,26,875.5619021664696,81kg,1,"45,966$"
1169,4346,138.876.05,1854.6488482739023,160,140kg,39,"1,453kg",83.39,,?
...,...,...,...,...,...,...,...,...,...,...
99009,3,47.337.78,307785,156,10626,N1,error,,three,4.55e+04
99275,error,28.239.73,one,191.34703054588581,75cm,,1815.657564529909,65.58,3.0114260441236333,47563.28
99560,,115.709.27,2356.778369868802,152.19,69.74,14.02,409.14476690756743,five,unknown,"€96,256.54"
99609,73.40236686371765,153.300.12,,,three,4858,2911,27cm,4cm,


We can see:

- Rows containing multiple decimal points in the selected column.
- Presence of 2 decimal points in a single number. E.g: ``(138.876.05)``

Multiple decimal points are usually invalid until and unless the domain specifically requires it.

##### **Valid Multiple Decimal Examples**:

**IP Address:** ``192.168.0.1``


Great!

We now know how we can diagnose dirty numerical data using ``diagnose_numbers()`` method of ``DirtyDataDiagnosis`` class.

We have explored:

1. **Valid Numbers**
2. **Dirty Numbers**
3. **Only Text**
4. **Only Symbols**
5. **Only Missing**
6. **Only Scientific Notations**
7. **Numbers containing spaces**
8. **Numbers containing text**
9. **Numbers containing single decimals**
10. **Numbers containing multiple decimals**
11. **Numbers containing single commas**
12. **Numbers containing multiple commas**
13. **Numbers containing units**
14. **Numbers containing currency symbols**