# Compare LogisticRegression for Aggregated Fields vs. Non-Aggregated Fields

In [507]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

## Aggregated Fields

In [508]:
# Set up so that Parts data is easy to learn (Tire, Axle, Wheel), and issue is more or less unrelated to each other.
# This proves that no matter how easy parts data is to learn, duplicates in issue column will probably far extreme and biased weights.
a = pd.DataFrame([
    ["Belt not tight", "Tire,Axle,Wheel", 1],
    ["Air bag", "Seat,Lock,Latch", 0],
    ["Power train is too strong and motorized", "Tire,Axle,Wheel", 1],
    ["Overheating in the motor", "Seat,Lock,Latch", 0],
], columns=["Issue", "Parts", "IS_COMPLAINT"])
a

Unnamed: 0,Issue,Parts,IS_COMPLAINT
0,Belt not tight,"Tire,Axle,Wheel",1
1,Air bag,"Seat,Lock,Latch",0
2,Power train is too strong and motorized,"Tire,Axle,Wheel",1
3,Overheating in the motor,"Seat,Lock,Latch",0


In [509]:
clf_agg = TfidfVectorizer()
combine_text = a["Issue"] + " " + a["Parts"]
issue_frame = clf_agg.fit_transform(combine_text)
label = a["IS_COMPLAINT"]
lr_agg = LogisticRegression(random_state=42)
lr_agg.fit(
    issue_frame, label
)

LogisticRegression(random_state=42)

## Non-Aggregated Fields

In [510]:
b = a.copy()
b["Parts"] = b["Parts"].str.split(",")
b = b.explode("Parts")
b

Unnamed: 0,Issue,Parts,IS_COMPLAINT
0,Belt not tight,Tire,1
0,Belt not tight,Axle,1
0,Belt not tight,Wheel,1
1,Air bag,Seat,0
1,Air bag,Lock,0
1,Air bag,Latch,0
2,Power train is too strong and motorized,Tire,1
2,Power train is too strong and motorized,Axle,1
2,Power train is too strong and motorized,Wheel,1
3,Overheating in the motor,Seat,0


In [511]:
combine_text = b["Issue"] + " " + b["Parts"]
clf_non_agg = TfidfVectorizer()
clf_non_agg.fit(combine_text)
issue_frame = clf_non_agg.transform(combine_text)
label = b["IS_COMPLAINT"]
lr_non_agg = LogisticRegression(random_state=42)
lr_non_agg.fit(
    issue_frame, label
)


LogisticRegression(random_state=42)

In [512]:
unique_issue = b["Issue"].str.lower().str.split().explode().unique()
unique_parts = b["Parts"].str.lower().str.split().explode().unique()

In [513]:
# Aggregated LR Weights
df_aggregated = pd.DataFrame(lr_agg.coef_,
    columns=clf_agg.get_feature_names()
)

In [514]:
# Non-Aggregated LR Weights
df_non_aggregate = pd.DataFrame(
    lr_non_agg.coef_,
    columns=clf_non_agg.get_feature_names()
)

## Compare Aggregated and Non-Aggregated Weights

In [515]:
df_aggregated[unique_parts]

Unnamed: 0,tire,axle,wheel,seat,lock,latch
0,0.234362,0.234362,0.234362,-0.273653,-0.273653,-0.273653


In [516]:
df_non_aggregate[unique_parts]

Unnamed: 0,tire,axle,wheel,seat,lock,latch
0,0.287286,0.287286,0.287286,-0.341084,-0.341084,-0.341084


Tire, Axle, Wheel raises log-odds of complaint, while seat, lock, latch decreases log-odds complaints. 

In [517]:
a

Unnamed: 0,Issue,Parts,IS_COMPLAINT
0,Belt not tight,"Tire,Axle,Wheel",1
1,Air bag,"Seat,Lock,Latch",0
2,Power train is too strong and motorized,"Tire,Axle,Wheel",1
3,Overheating in the motor,"Seat,Lock,Latch",0


This proves that both models are able to learn feature weights for component parts, regardless of rather data is duplicated on issue or not, because logistic regression is trained based on **average loss across all features** and in both data formats, parts appear the same amount of times regardless of data format, so the weights trained are similar.

In [518]:
(df_aggregated[unique_parts] - df_non_aggregate[unique_parts]).T.describe()

Unnamed: 0,0
count,6.0
mean,0.007254
std,0.065921
min,-0.052924
25%,-0.052924
50%,0.007254
75%,0.067431
max,0.067431


In [519]:
df_aggregated[unique_issue]

Unnamed: 0,belt,not,tight,air,bag,power,train,is,too,strong,and,motorized,overheating,in,the,motor
0,0.170761,0.170761,0.170761,-0.191577,-0.191577,0.126498,0.126498,0.126498,0.126498,0.126498,0.126498,0.126498,-0.155517,-0.155517,-0.155517,-0.155517


In [520]:
df_non_aggregate[unique_issue]

Unnamed: 0,belt,not,tight,air,bag,power,train,is,too,strong,and,motorized,overheating,in,the,motor
0,0.450449,0.450449,0.450449,-0.515165,-0.515165,0.310878,0.310878,0.310878,0.310878,0.310878,0.310878,0.310878,-0.388731,-0.388731,-0.388731,-0.388731


In [521]:
(df_aggregated[unique_issue] - df_non_aggregate[unique_issue]).T.describe()

Unnamed: 0,0
count,16.0
mean,-0.034356
std,0.242314
min,-0.279689
25%,-0.18438
50%,-0.18438
75%,0.233214
max,0.323588


You can see for duplicated issue column, standard deviation is much higher. Non aggregate data format would train multiple times on the same issue and develop much more extreme weights for specific terms.

These weights will really explode in our logistic regression on recall data, since data duplication is present on nearly 80% of the dataset.

With that being said, if you're interested in predicting likelihood of entry being recall or complaint and want to exploit the component parts column, feature-based, non-linear algorithms like Neural Networks and Random Forests will learn from parts much better.