## Drop least Columns Above a threshold

Dropping columns with correlation greater than a threshold is easy. But the tricky part is with the word 'Least'. 
Consider 4 columns c1,c2,c3,c4
Lets say these particular pairs have correlation more the threshold:


*   (c1 c2)
*   (c2 c4)
*   (c2 c3)
*   (c3 c4)


Now if we simply drop all unique columns in above pairs. c1 will be dropped while being the innocent.
so the algorithm down below uses one extra layer of filter to check if the column is actually the one with more correlation.

In [6]:
def drop_above_corr_thresh(df,thresh=0.85):
  #This function returns list of columns to drop whose correlation is above specified threshold.
    corr=df.corr()
    cols=corr.columns.to_list()
    cols_to_drop=set() # a set of columns which will be returned to the user to drop
    above_thresh_pairs=[] #pairs with correlation above specified threshold
    
    for i in range(len(cols)):
        for j in range(i,len(cols)):
            a,b=cols[i],cols[j]
            if abs(corr.loc[a,b])>thresh and i!=j: 
              #if correlation is greater then threshold and columns are different
                above_thresh_pairs.append((a,b))
    # Now we'll compare the overall sum of absoulte correlation of each above threshold feature
    # with every other feature in the dataset
    # the feature with greater threshold will be added to cols_to_drop and ultimately dropped.
    for pair in above_thresh_pairs:
        a=abs(corr[pair[0]]).sum()
        b=abs(corr[pair[1]]).sum()
        if a>b:
            cols_to_drop.add(pair[0])
        else:
            cols_to_drop.add(pair[1])
    return list(cols_to_drop)   

In [7]:
q2=pd.read_csv("/content/drive/My Drive/summer-products-with-rating-and-performance_2020-08.csv")
q2=pd.concat([q2]*50)

In [8]:
q2.shape

(78650, 43)

In [9]:
q2.head(1)

Unnamed: 0,title,title_orig,price,retail_price,currency_buyer,units_sold,uses_ad_boosts,rating,rating_count,rating_five_count,rating_four_count,rating_three_count,rating_two_count,rating_one_count,badges_count,badge_local_product,badge_product_quality,badge_fast_shipping,tags,product_color,product_variation_size_id,product_variation_inventory,shipping_option_name,shipping_option_price,shipping_is_express,countries_shipped_to,inventory_total,has_urgency_banner,urgency_text,origin_country,merchant_title,merchant_name,merchant_info_subtitle,merchant_rating_count,merchant_rating,merchant_id,merchant_has_profile_picture,merchant_profile_picture,product_url,product_picture,product_id,theme,crawl_month
0,2020 Summer Vintage Flamingo Print Pajamas Se...,2020 Summer Vintage Flamingo Print Pajamas Se...,16.0,14,EUR,100,0,3.76,54,26.0,8.0,10.0,1.0,9.0,0,0,0,0,"Summer,Fashion,womenunderwearsuit,printedpajam...",white,M,50,Livraison standard,4,0,34,50,1.0,Quantité limitée !,CN,zgrdejia,zgrdejia,(568 notes),568,4.128521,595097d6a26f6e070cb878d1,0,,https://www.wish.com/c/5e9ae51d43d6a96e303acdb0,https://contestimg.wish.com/api/webimage/5e9ae...,5e9ae51d43d6a96e303acdb0,summer,2020-08


In [10]:
drop_above_corr_thresh(q2)

['rating_four_count',
 'shipping_option_price',
 'rating_two_count',
 'rating_three_count',
 'rating_five_count',
 'rating_count']