## **Introduction**

The adult.data contain over 30000 rows of demographic information and some of then can combined into groups. For example one person have 'native-country' = 'United-States' and 'capital-gain' = '0'. These attributes can lead to strategic data-driven decisions. Take first 10 line of data as input, we can find the following patterns that identify the groups with no capital gain or loss:

- There are 7 people with {capital-gain=None, capital-loss=None}	
- There are 5 people with {native-country=United-States, capital-gainn=None, capital-loss=None}
- There are 5 people with {native-country=United-States, capital-gain=None}
- There are 8 people with {native-country=United-States, capital-loss=None}

The *support* of the set of attributes in definded as the ratio of "total number of records with the given attributes" to " total number of records in the dataset". For example.

- The support of {capital-gain=None, capital-loss=None}	is 7/10 = 0.7

We can now derive some rules **X=>Y**, where X and Y describe the attributes set. The confidence of the rule is defined as the ratio of "total number of records with the given unique attributes in X and Y" to "total number of records with the given attributes in X", i.e., the ratio of "support of X U Y" to "support of X". For example:

- The confidence of the rule {native-country=United-States, capital-gain=None} =>{capital-loss=None} is 0.5/0/5=1.0
- The confidence of the rule {capital-gain=None, capital-loss=None} =>{native-country=United-States} is 0.5/0/7=0.71



### **Task**
Rearrange the given set of rules X=>Y in descending order of confidence. It is guaranteed that no two rules have the same confidence. Also, the support of the attributes sets X and Y in each of the rules is greater thant or equal to 0.3

### **Function Description**
Complete the *arrangeingRules* function. The function must return a string of the rules in descending order of confidence.

*arrangingRules* has the following parameter:
rules: an array of rules strings

###  **Sample Input**
3\
{native-country=United-States,capital-gain=None}=>{capital-loss=None} \
{capital-gain=None,capital-loss=None}=>{native-country=United-States} \
{native-country=United-States,capital-loss=None}=>{capital-gain=None}

### **Sample Output**
{native-country=United-States,capital-gain=None}=>{capital-loss=None}\
{native-country=United-States,capital-loss=None}=>{capital-gain=None}\
{capital-gain=None,capital-loss=None}=>{native-country=United-States} \

### **Explanation**
- The confidence of {native-country=United-States,capital-gain=None}=>{capital-loss=None} is 0.94
- The confidence of {native-country=United-States,capital-loss=None}=>{capital-gain=None} is 0.9098
- The confidence of {capital-gain=None,capital-loss=None}=>{native-country=United-States} is 0.9091

### Import Data

In [7]:
!pip install -q wget
import wget
!wget -q "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

In [8]:
!ls

adult.data  adult.data.1  adult.data.2	sample_data


In [12]:
import pandas as pd
import string
!pip install -q apyori
from apyori import apriori
names = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income'
]
data = pd.read_csv('adult.data', names=names)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### take 10 line as train dataset


In [13]:
df=data[['native-country','capital-gain','capital-loss']]

def checkone(num):
    if num==0:
        return 'None'
    else:
        return 'other'

### Select native-country, capital-gain, capital-loss as sample attributes.


In [14]:

# capital columns
df.iloc[:,1:3]=df.iloc[:,1:3].applymap(checkone)
#country column
df.loc[df['native-country']!=' United-States','native-country']='other'
#df=df.head(10)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,native-country,capital-gain,capital-loss
0,United-States,other,
1,United-States,,
2,United-States,,
3,United-States,,
4,other,,
...,...,...,...
32556,United-States,,
32557,United-States,,
32558,United-States,,
32559,United-States,,


### Create form as attribute=value.

In [15]:
records = []
for i in range(0, len(df)):
    records.append([df.columns[j]+'='+str(df.values[i,j]).strip()  for j in range(0, len(df.columns))])
records[0]

['native-country=United-States', 'capital-gain=other', 'capital-loss=None']

### Create support value table via apriori function.



In [16]:
# Here we choose the max_length of 3, while in real data we should not limit the max_length.
association_rules = apriori(records, min_support = 0.3, min_confidence = 0.2, min_lift = 1, min_length = 2, max_length = 3)
association_results = list(association_rules)
association_results[0]

RelationRecord(items=frozenset({'capital-gain=None'}), support=0.9167101747489328, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'capital-gain=None'}), confidence=0.9167101747489328, lift=1.0)])

In [17]:
results=[]

for item in association_results[:]:
    
    value0=set(item[0])
    value1=str(item[1])
    
    rows=(value0,value1)
    results.append(rows)

labels=['Set','Support']

census=pd.DataFrame.from_records(results,columns=labels)

census

Unnamed: 0,Set,Support
0,{capital-gain=None},0.9167101747489328
1,{capital-loss=None},0.9533490986149074
2,{native-country=United-States},0.895857006848684
3,"{capital-loss=None, capital-gain=None}",0.8700592733638401
4,"{capital-gain=None, native-country=United-States}",0.8199686741807684
5,"{capital-loss=None, native-country=United-States}",0.8535057277110654
6,"{capital-loss=None, capital-gain=None, native-...",0.7776173950431498


### Set sample input

In [19]:
#test=input()

rules='3 {native-country=United-States,capital-gain=None}=>{capital-loss=None} {capital-gain=None,capital-loss=None}=>{native-country=United-States} {native-country=United-States,capital-loss=None}=>{capital-gain=None}'


### Creating temporary function

In [20]:
def arrangingRules(rules):
  length=int(rules.split()[0])
  rule_list=rules.split()[1:]
    
  unsort_rules_list=[]

  
  for item in rule_list:
    #cut the useless punctuation 
    X=set(item.split('=>')[0].strip(string.punctuation).split(','))
    Y=set(item.split('=>')[1].strip(string.punctuation).split(','))
        
    #Use set calculation to get confidence of rules
    support_xy=float(census.loc[census['Set']==X.union(Y),'Support'])
    support_x=float(census.loc[census['Set']==X,'Support'])
        
    confidence=support_xy/support_x
    
    rows=(item,confidence)
        
    unsort_rules_list.append(rows)
  lables=['X=>Y','Confidence']

  # Store unsorted rules
  unsort_rules=pd.DataFrame.from_records(unsort_rules_list,columns=lables)
  
  # Print out final sorted rules
  print(unsort_rules.sort_values('Confidence',ascending=False).iloc[:,0].to_string(index=False))
    

### Check function result.

In [21]:
df=arrangingRules(rules)
df

 {native-country=United-States,capital-gain=Non...
 {native-country=United-States,capital-loss=Non...
 {capital-gain=None,capital-loss=None}=>{native...


We got the same result as we calculate before. The format also fit the requirement.



# Above is all the function

### below is steps in writing function

In [24]:
def arrangingRules(rules):
  # Import necessary packages
  import pandas as pd
  import string
  !pip install -q apyori
  from apyori import apriori

  # Processing data
  names = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income',
  ]
  df = pd.read_csv('adult.data', names=names)

  # Format the origin data
  records = []
  for i in range(0, len(df)):
      records.append([df.columns[j]+'='+str(df.values[i,j]).strip()  for j in range(0, len(df.columns))])
  records[0]

  # Create rules and relative support
  association_rules = apriori(records, min_support = 0.3, min_confidence = 0.2, min_lift = 1, min_length = 2, max_length = 12)
  association_results = list(association_rules)

  # Create support table
  length=int(rules.split()[0])
  rule_list=rules.split()[1:]
    
  unsort_rules_list=[]

  results=[]

  for item in association_results[:]:

    value0=set(item[0])
    value1=str(item[1])
      
    rows=(value0,value1)
    results.append(rows)

  labels=['Set','Support']

  census=pd.DataFrame.from_records(results,columns=labels)

  for item in rule_list:
    #cut the useless punctuation 
    X=set(item.split('=>')[0].strip(string.punctuation).split(','))
    Y=set(item.split('=>')[1].strip(string.punctuation).split(','))
        
    #Use set calculation to get confidence of rules
    support_xy=float(census.loc[census['Set']==X.union(Y),'Support'])
    support_x=float(census.loc[census['Set']==X,'Support'])
        
    confidence=support_xy/support_x
    
    rows=(item,confidence)
        
    unsort_rules_list.append(rows)
  lables=['X=>Y','Confidence']

  # Store unsorted rules
  unsort_rules=pd.DataFrame.from_records(unsort_rules_list,columns=lables)
  
  # Print out final sorted rules
  print(unsort_rules.sort_values('Confidence',ascending=False).iloc[:,0].to_string(index=False))


arrangingRules(rules)

KeyboardInterrupt: ignored